Logical Analysis of Data and Biomedical Applications

Logical Analysis of Diffuse Large B Cell Lymphoma Gabriela Alexe1, Sorin Alexe1, David Axelrod2, Peter Hammer1, and David Weissmann3 of RUTCOR(1) and Department of Genetics(2), Rutgers University; and Robert Wood Johnson Medical School(3) This Talk • Lymphoma • Gene Expression Level Analysis R U T C O R • cDNA Microarray • Applied to Diffuse Large B-Cell Lymphoma • Logical Analysis of Data • • • • • Discretization/Binarization Support Sets Pattern Generation Theories and Models Prediction 2 Lymphoma Lymphoma R U T C O R • Cancer of lymphoid cells • Clonal • Uncontrolled growth • Metastasis • Lymphoma • Diagnosis • Grade 4 Diffuse Large B Cell Lymphoma (DLBCL) R U T C O R • 31% of non-Hodgkin lymphoma cases • 50% long-term, disease-free survival • Clinical variability • Prognosis & therapy • IPI • Morphology • Gene expression 5 Diffuse Large B Cell Lymphoma R U T C O R 6 Spleen with Diffuse Large B Cell Lymphoma R U T C O R 7 Gene Expression Level Analysis DNA-RNA Hybridization R U T C O R 9 Gene Expression Profiling Tumor R U T C O R Standard cDNA microarray analysis 10 DLBCL & cDNA Microarray Analysis R U T C O R • Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Alizadeh et al., Nature, Vol 403, pp 503-511 • cDNA microarray data -> unsupervised hierarchical agglomerative clustering • Germinal center signature: 76% survival at 5 years • Activated B cell signature: 16% at 5 years 11 DLBCL Clustering R U T C O R Each case (patient) is a point in N-dimensional space where N = # of genes Germinal center genes Activated B cell genes 12 DLBCL Survival by Type R U T C O R 13 Supervised Learning Classification of DLBCL R U T C O R • Diffuse large B-cell lymphoma prediction by gene-expression profiling and supervised machine learning Shipp et al., Nature Medicine, vol 8, p 68-74 • Prognosis of DLBCL • Highly correlated genes -> weighted voting algorithm 14 Logical Analysis of Data Logical Analysis of Data (LAD) • Non-statistical method based on: R U T C O R • Combinatorics • Optimization • Logic • Based on dataset of cases/patients • LAD learns patterns characteristic of classes • Subsets of patients who are +/- for a condition • Collections of patterns are extensible • Predictions 17 The Problem : Approximation of Hidden Function R U T C O R Dataset Hidden Function LAD Approximation 18 Main Components of LAD R U T C O R • Discretization/Binarization • Support Sets • Pattern Generation • Theories and Models • Prediction 19 Discretization Separating Cutpoints Minimum Set of Separating Cutpoints R U T C O R 20 Cutpoints and Support Set R U T C O R • Minimization is NP hard • Numerous powerful methods • Support set: • Cutpoints define a grid in which ideally no cell contains both + and – cases • Cutpoints simplify data and decrease noise 21 Patterns • Examples: R U T C O R • Gene A > 34 & gene B < 24 & gene C < 2 • Positive and negative patterns • Pattern parameters: • Degree (# of conditions) • Prevalence (# of +/- cases that satisfy it) • Homogeneity (proportion of +/- cases among those it covers) • Best: low degree, large prevalence, high homogeneity • Patterns are extensible! 22 Pattern Generation • Generate patterns based on learning set • Stipulate control parameters. For example: R U T C O R • Degree 4 • + & - prevalences >= 70% • + & - homogeneities = 100% • All 75 patterns in 1.2 seconds on Pentium IV 1 Gz PC • Evaluate set: • Average # of patterns covering each observation • Accuracy applied to evaluation set 23 R U T C O R Patterns: Illustration Positive Pattern Negative Pattern 24 Theories: Approximations of the 2 Regions R U T C O R A theory is a set of positive (or negative) patterns such that every positive (or negative) case is covered. Positive Theory Negative Theory 25 Models • A set of a positive and a negative theory R U T C O R • A good model: • Small number of features (genes) • Patterns are high quality • Low degrees • High prevalences • High homogeneities • Number of patterns is small • Maximize their biologic interpretability 26 R U T C O R Theories and Models Positive Theory Unexplained Area Negative Theory Model Positive Area Discordant Area Negative Area 27 LAD Prediction R U T C O R • A new case: a set of gene expression levels • Satisfy some positive & no negative? • Satisfy some negative & no positive ? • Satisfy some of both? • Which more? • Does not satisfy any (rare) 28 8 Gene Classification Model Gene index 6642 6992 3890 5383 3674 2004 1692 R U T C O R Prevalence (%) 2280 Description Butyrophilin (BTF1) Dystrobrevin-alpha mRNA P120E4F mRNA transcription Mitogen induced factor SM15 mRNA nuclear gene orphan (human Neurotrophin-3 receptor interferon-related Lecithin-cholesterol (MINOR) (NT-3) gene mRNA protein BETA-1,4 SM15 acyltransferase N-ACETYLGALACTOSAMINYLTRANSFERASE (U09585);mRNA, final exon withsimilar 5' and to 3' partial flankingsequence DNA sequences of human Accession # U90543_at U46744_at U87269_at U12767_at U73167_cds5_at M37763_at M12625_at M83651_at Pattern P1 >0.49 P2 >0.48 P3 >0.48 P4 >0.48 0. >0.3 >0.46 >0.46 0.40 0.36 >0.47 0.3 P5 >0.46 P6 >0.63 P7 Test set Positive Negative Positive Negative 72.22 0.00 62.50 30.00 72.22 0.00 50.00 20.00 72.22 0.00 62.50 10.00 >0.0 72.22 0.00 50.00 20.00 >0.6 61.11 0.00 62.50 20.00 61.11 0.00 50.00 10.00 55.56 0.00 25.00 0.00 55.56 0.00 50.00 20.00 0.30 >0.46 >0.8 P8 >0.1 Training set >0.8 P9 >0.49 55.56 0.00 50.00 30.00 N1 0.60 0.69 0.6 0.00 72.73 12.50 70.00 N2 0.3 0.69 0.7 0.00 68.18 12.50 50.00 N3 0.3 0.69 0.00 63.64 12.50 40.00 N4 0.60 0.00 63.64 50.00 70.00 N5 0.3 0.00 63.64 0.00 50.00 0.00 59.09 0.00 40.00 N6 >0. 0. >0.10.6 0.69 >0.10.69 0.3 0. 29 Accuracy of Prognosis R U T C O R ACCURACY OF PROGNOSIS Logistic Regression Artificial Neural Networks CART Fisher Discriminant LAD Sensitivity (%) 88.9 94.4 55.6 94.4 100 Specificity (%) 90.9 90.9 90.9 81.8 100 Sensitivity (%) 50 87.5 75 50 87.5 Specificity (%) 60 80 60 80 90 Training Test set 30 Conclusion R U T C O R • Logical Analysis of Data (LAD ): a versatile new classification method here applied to diagnosis and prognosis of lymphoma. • LAD genes differ almost entirely from those specified by other studies. • Genes not individually correlated with diagnosis or prognosis but highly correlated in combinations of as few as two genes. • Patterns suggest biologic pathways • LAD provides highly accurate prognosis of DLBCL 31 Contacts R U T C O R • Gabriela Alexe: galexe@us.ibm.com • Soren Alexe: salexe@rutcor.rutgers.edu • David Axelrod: axelrod@biology.rutgers.edu • Peter Hammer: hammer@rutcor.rutgers.edu • David Weissmann: weissmdj@umdnj.edu 32

Logical Analysis of Data and Biomedical Applications

Related documents

Products

Support

Logical Analysis of Data and Biomedical Applications

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib