Logical Analysis of Diffuse Large B Cell Lymphoma Gabriela Alexe1, Sorin Alexe1, David Axelrod2, Peter Hammer1, and David Weissmann3 of RUTCOR(1) and Department of Genetics(2), Rutgers University; and Robert Wood Johnson Medical School(3) This Talk • Lymphoma • Gene Expression Level Analysis R U T C O R • cDNA Microarray • Applied to Diffuse Large B-Cell Lymphoma • Logical Analysis of Data • • • • • Discretization/Binarization Support Sets Pattern Generation Theories and Models Prediction 2 Lymphoma Lymphoma R U T C O R • Cancer of lymphoid cells • Clonal • Uncontrolled growth • Metastasis • Lymphoma • Diagnosis • Grade 4 Diffuse Large B Cell Lymphoma (DLBCL) R U T C O R • 31% of non-Hodgkin lymphoma cases • 50% long-term, disease-free survival • Clinical variability • Prognosis & therapy • IPI • Morphology • Gene expression 5 Diffuse Large B Cell Lymphoma R U T C O R 6 Spleen with Diffuse Large B Cell Lymphoma R U T C O R 7 Gene Expression Level Analysis DNA-RNA Hybridization R U T C O R 9 Gene Expression Profiling Tumor R U T C O R Standard cDNA microarray analysis 10 DLBCL & cDNA Microarray Analysis R U T C O R • Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Alizadeh et al., Nature, Vol 403, pp 503-511 • cDNA microarray data -> unsupervised hierarchical agglomerative clustering • Germinal center signature: 76% survival at 5 years • Activated B cell signature: 16% at 5 years 11 DLBCL Clustering R U T C O R Each case (patient) is a point in N-dimensional space where N = # of genes Germinal center genes Activated B cell genes 12 DLBCL Survival by Type R U T C O R 13 Supervised Learning Classification of DLBCL R U T C O R • Diffuse large B-cell lymphoma prediction by gene-expression profiling and supervised machine learning Shipp et al., Nature Medicine, vol 8, p 68-74 • Prognosis of DLBCL • Highly correlated genes -> weighted voting algorithm 14 Logical Analysis of Data Logical Analysis of Data (LAD) • Non-statistical method based on: R U T C O R • Combinatorics • Optimization • Logic • Based on dataset of cases/patients • LAD learns patterns characteristic of classes • Subsets of patients who are +/- for a condition • Collections of patterns are extensible • Predictions 17 The Problem : Approximation of Hidden Function R U T C O R Dataset Hidden Function LAD Approximation 18 Main Components of LAD R U T C O R • Discretization/Binarization • Support Sets • Pattern Generation • Theories and Models • Prediction 19 Discretization Separating Cutpoints Minimum Set of Separating Cutpoints R U T C O R 20 Cutpoints and Support Set R U T C O R • Minimization is NP hard • Numerous powerful methods • Support set: • Cutpoints define a grid in which ideally no cell contains both + and – cases • Cutpoints simplify data and decrease noise 21 Patterns • Examples: R U T C O R • Gene A > 34 & gene B < 24 & gene C < 2 • Positive and negative patterns • Pattern parameters: • Degree (# of conditions) • Prevalence (# of +/- cases that satisfy it) • Homogeneity (proportion of +/- cases among those it covers) • Best: low degree, large prevalence, high homogeneity • Patterns are extensible! 22 Pattern Generation • Generate patterns based on learning set • Stipulate control parameters. For example: R U T C O R • Degree 4 • + & - prevalences >= 70% • + & - homogeneities = 100% • All 75 patterns in 1.2 seconds on Pentium IV 1 Gz PC • Evaluate set: • Average # of patterns covering each observation • Accuracy applied to evaluation set 23 R U T C O R Patterns: Illustration Positive Pattern Negative Pattern 24 Theories: Approximations of the 2 Regions R U T C O R A theory is a set of positive (or negative) patterns such that every positive (or negative) case is covered. Positive Theory Negative Theory 25 Models • A set of a positive and a negative theory R U T C O R • A good model: • Small number of features (genes) • Patterns are high quality • Low degrees • High prevalences • High homogeneities • Number of patterns is small • Maximize their biologic interpretability 26 R U T C O R Theories and Models Positive Theory Unexplained Area Negative Theory Model Positive Area Discordant Area Negative Area 27 LAD Prediction R U T C O R • A new case: a set of gene expression levels • Satisfy some positive & no negative? • Satisfy some negative & no positive ? • Satisfy some of both? • Which more? • Does not satisfy any (rare) 28 8 Gene Classification Model Gene index 6642 6992 3890 5383 3674 2004 1692 R U T C O R Prevalence (%) 2280 Description Butyrophilin (BTF1) Dystrobrevin-alpha mRNA P120E4F mRNA transcription Mitogen induced factor SM15 mRNA nuclear gene orphan (human Neurotrophin-3 receptor interferon-related Lecithin-cholesterol (MINOR) (NT-3) gene mRNA protein BETA-1,4 SM15 acyltransferase N-ACETYLGALACTOSAMINYLTRANSFERASE (U09585);mRNA, final exon withsimilar 5' and to 3' partial flankingsequence DNA sequences of human Accession # U90543_at U46744_at U87269_at U12767_at U73167_cds5_at M37763_at M12625_at M83651_at Pattern P1 >0.49 P2 >0.48 P3 >0.48 P4 >0.48 0. >0.3 >0.46 >0.46 0.40 0.36 >0.47 0.3 P5 >0.46 P6 >0.63 P7 Test set Positive Negative Positive Negative 72.22 0.00 62.50 30.00 72.22 0.00 50.00 20.00 72.22 0.00 62.50 10.00 >0.0 72.22 0.00 50.00 20.00 >0.6 61.11 0.00 62.50 20.00 61.11 0.00 50.00 10.00 55.56 0.00 25.00 0.00 55.56 0.00 50.00 20.00 0.30 >0.46 >0.8 P8 >0.1 Training set >0.8 P9 >0.49 55.56 0.00 50.00 30.00 N1 0.60 0.69 0.6 0.00 72.73 12.50 70.00 N2 0.3 0.69 0.7 0.00 68.18 12.50 50.00 N3 0.3 0.69 0.00 63.64 12.50 40.00 N4 0.60 0.00 63.64 50.00 70.00 N5 0.3 0.00 63.64 0.00 50.00 0.00 59.09 0.00 40.00 N6 >0. 0. >0.10.6 0.69 >0.10.69 0.3 0. 29 Accuracy of Prognosis R U T C O R ACCURACY OF PROGNOSIS Logistic Regression Artificial Neural Networks CART Fisher Discriminant LAD Sensitivity (%) 88.9 94.4 55.6 94.4 100 Specificity (%) 90.9 90.9 90.9 81.8 100 Sensitivity (%) 50 87.5 75 50 87.5 Specificity (%) 60 80 60 80 90 Training Test set 30 Conclusion R U T C O R • Logical Analysis of Data (LAD ): a versatile new classification method here applied to diagnosis and prognosis of lymphoma. • LAD genes differ almost entirely from those specified by other studies. • Genes not individually correlated with diagnosis or prognosis but highly correlated in combinations of as few as two genes. • Patterns suggest biologic pathways • LAD provides highly accurate prognosis of DLBCL 31 Contacts R U T C O R • Gabriela Alexe: galexe@us.ibm.com • Soren Alexe: salexe@rutcor.rutgers.edu • David Axelrod: axelrod@biology.rutgers.edu • Peter Hammer: hammer@rutcor.rutgers.edu • David Weissmann: weissmdj@umdnj.edu 32