Robust diagnosis of DLBCL from gene expression data from different laboratories DIMACS - RUTCOR Workshop on Boolean and Pseudo-Boolean Functions in Memory of Peter L. Hammer January 19-22, 2009 1 Peter L Hammer Sorin Alexe David E Axelrod Gustavo Stolovitzky IBM TJ WATSON RESEARCH RUTGERS UNIV Gyan Bhanot Arnold J Levine INSTITUTE FOR ADVANCED STUDY PRINCETON David Weissmann 2 CANCER INSTITUTE OF NEW JERSEY Overview Motivation Pattern-based ensemble classifiers Case study – compare data from two labs for DLBCL vs FL diagnosis Shipp et al. (2002) Nature Med.; 8(1), 68-74. (Whitehead Lab) Stolovitzky G. (2005) In Deisboeck et al Complex Systems Science in BioMedicine (in press) (preprint: http://www.wkap.nl/prod/a/Stolovitzky.pdf). (DellaFavera Lab) Alexe, Alexe, Axelrod, Hammer, Weissmann (2005) Artificial Intelligence in Medicine Bhanot, Alexe, Stolowitzky, Levine (2005) Genome Informatics 3 Non-Hodgkin lymphomas FL low grade non-Hodgkin lymphoma / no cure if advanced stage second most frequent subtype of nodal lymphoid malignancies Incidence has risen from 2–3/ to more than 5–7/ 100,000/year (’50 –’00) t(14;18) translocation:over-expression of anti-apoptotic bcl2 25-60% FL cases evolve to DLBCL DLBCL high grade non-Hodgkin lymphoma / high variability to treatment most frequent subtype of NHL < 2 years survival if untreated Biomarkers: FL transformation to DLBCL • • • • p53/MDM2 (Moller et al., 1999) p16 (Pyniol, 1998) p38MAPK (Elenitoba-Johnson et al., 2003) c-myc (Lossos et al., 2002) 4 Gene arrays Gene arrays are a way to study the variation of mRNA levels between different types of cells. This allows diagnosis and inference of pathways that cause disease / early stage diagnosis Identify molecular profiles of disease – personalized medicine 5 Lymphoma datasets Data: WI (Shipp et al., 2002) Affy HuGeneFL CU (DallaFavera Lab, Stolovitzky, 2005) Affy Hu95Av2 Samples: WI: 58 DLBCL & 19 FL CU: 14 DLBCL & 7 FL Genes: WI: 6817 CU: 12581 6 Diagnosis problem Input Training (biomedical) data: 2 classes: FL and DLBCL m samples described by N >> features Output Collection of robust biomarkers, models Robust, accurate classifier / tested on out-of-sample data 7 Data preprocessing Input data Creating training and test data Normalization Noise estimation Robust feature selection Biology-based feature selection Filtering Support set selection Filtering Support set selection INDIVIDUAL CLASSIFIERS Artificial Neural Netw orks Support Vector Machines Pattern data (training) Weighted Voting System (LAD) k-Nearest Neighbors Calibration Decision Trees (C4.5) Logistic Regression Raw data (training) Principal Com ponents INTERMEDIATE CLASSIFIERS Classifier (Weighted Voting) META-CLASSIFIER Validation (test data) 8 Patterns (Logical Analysis of Data, Hammer 1988) Positive Patterns Negative Patterns Model -Exhaustive collections of patterns -Pattern space -Classification / attribute analysis / new class identification 9 Data Preprocessing 50 % P calls, UL = 16000, LL = 20 2/1 stratify WI data to train/test CU data test Normalize data to median 1000 per array Generate 500 data sets using noise + k fold stratified sampling + jackknife Find genes with high correlation to phenotype using t-test or SNR. Keep genes that are in > 90% of datasets 10 Choosing support sets Create quality patterns using small subsets of genes, validate using weighted voting with 10 fold cross validation Sort genes by their appearance in good patterns Select top genes to cover each sample by at least 10 patterns Alexe, Alexe, Hammer, Vizvari (2005) 11 Genes@Work t-test * * TXNIP * * metastases suppressor DNASE1L3 * * apoptosis CDH11 * * LUCA15 oxidative stress * cell adhesion * apoptosis GPR18 * * * signaling pathway CLU * * * apoptosis LY9 * * cell adhesion RHOH * * T-cell differentiation ELF2 The 30 genes that transcription CCNG2 * CR2 CDKN2D * * cell cycle signal transduction G18 cell growth LY86 * apoptosis ARPC1B FL from DLBCL cell cycle complement activation PPP2R5C best distinguish Biological function p53 regulated Shipp et al. * Gene symbol SEPP1 cell motility MCM7 * BCL2A1 * * * cell cycle * * * apoptosis IMPDH2 * * RRP45 STAT1 * DLG7 * * SLC1A5 * * TUBB2 * GMP biosynthesis immune response NF-kappaB cascade * cell-cell signaling transport * PSMA6 microtubule movement protein catabolism PSMC1 * * * spinocerebellar ataxia LGALS3 * * * sugar binding CLTA * * transport PAGA * * cell proliferation 12 Genes identified by LAD (AIIM 2005) to distinguish DLBCL from FL # Gene index Gene description Accession # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 506 1612 972 2137 605 6815 7102 2988 4028 4292 4485 1430 1988 582 1092 2929 3005 4010 2789 6703 DNA REPLICATION LICENSING FACTOR CDC47 HOMOLOG (clone GPCR W) G protein-linked receptor gene (GPCR) gene, 5' end of cds Rad2 HIGH AFFINITY IMMUNOGLOBULIN GAMMA FC RECEPTOR I "A FORM" PRECURSOR 5-aminoimidazole-4-carboxamide-1-beta-D-ribonucleotide transformylase/inosinicase Tubulin, Beta 2 HLA-A MHC class I protein HLA-A (HLA-A28,-B40, -Cw3) RCH1 RAG (recombination activating gene) cohort 1 LDHA Lactate dehydrogenase A PKM2 Pyruvate kinase, muscle IDH2 Isocitrate dehydrogenase 2 (NADP+), mitochondrial Protein tyrosine phosphatase (CIP2)mRNA INSULIN-LIKE GROWTH FACTOR BINDING PROTEIN 3 PRECURSOR KIAA0175 gene GAMMA-INTERFERON-INDUCIBLE PROTEIN IP-30 PRECURSOR Mitochondrial serine hydroxymethyltransferase gene, nuclear encoded mitochondrion protein Bcl-2 related (Bfl-1) mRNA PGK1 Phosphoglycerate kinase 1 CENPA Centromere protein A (17kD) Dents Disease candidate gene D55716_at L42324_at HG4074-HT4344_at M63835_at D82348_at HG1980-HT2023_at M94880_f_at U28386_at X02152_at X56494_at X69433_at L25876_at M35878_at D79997_at J03909_at U23143_at U29680_at V00572_at U14518_at X81836_s_at Pearson correlation of Frequency of genes in participation in the support set definition of with combinatorial DLBCL vs biomarkers FL outcome 0.45 -0.49 0.45 0.43 0.53 0.50 -0.43 0.48 0.62 0.55 0.47 0.44 -0.28 0.45 0.53 0.42 0.44 0.36 0.51 0.37 Functional gene group # (*) 42.08 30.00 23.33 23.33 22.50 8.33 8.33 7.08 6.25 5.00 5.00 4.17 4.17 2.08 2.08 2.08 2.08 2.08 0.00 0.00 1 2 1 2 4 2 1 6 6 6 5 2 3 5 6 5 - Table 1. Selected non-minimal support set of 20 genes for distingushing DLBCL from FL cases. * 1: DNA replication, recombination and repair, 2: cell surface proteins and receptors, 3: protein synthesis and degradation, 4: structural proteins, 5: cell cycle and apoptosis, 6: metabolism, -: other. 13 Examples of FL and DLBCL patterns Gene Symbol Prevalence (%) Pattern Training set GPR18 P1 P2 N1 CLU DLG7 >-1.13 £0.91 >-0.26 Test set MCM7 >-0.62 >-0.77 £-0.55 Pos Neg Pos Neg 97 95 0 0 0 100 91 79 3 23 31 54 WI training data: Each DLBCL case satisfies at least one of the patterns P1 and P2 Each FL case satisfies the pattern N1 (and none of the patterns P1 and P2) 14 Pattern data Negative patterns DLBCL Positive patterns FL WI test data WI training data CU test data 15 Meta-classifier performance Training Classifier Weight Sensitivity Specificity Error rate Sensitivity Specificity Error rate (%) (%) (%) (%) (%) (%) Trained on raw data ANN 0.08 SVM 0.08 kNN 0.09 WV 0.07 C4.5 0.06 LR 0.07 ANN 0.10 SVM 0.10 kNN 0.10 WV 0.10 C4.5 0.10 LR 0.05 Meta-classifier Trained on pattern data Test 94.74 97.37 97.37 92.11 94.74 97.37 100.00 100.00 100.00 100.00 100.00 100.00 100.00 92.31 92.31 100.00 92.31 84.62 84.62 100.00 100.00 100.00 100.00 100.00 76.92 100.00 5.88 3.92 1.96 7.84 7.84 5.88 0.00 0.00 0.00 0.00 0.00 5.88 0.00 82.35 97.06 91.18 94.12 94.12 94.12 97.06 97.06 100.00 97.06 91.18 100.00 100.00 84.62 76.92 84.62 76.92 69.23 69.23 76.92 76.92 69.23 76.92 76.92 61.54 76.92 17.02 8.51 10.64 10.64 12.77 12.77 8.51 8.51 8.51 8.51 12.77 10.64 16 6.38 Error distribution: raw and pattern data Meta-classifier Classifiers trained on pattern data Classifiers trained on raw data 0 WI 10 test data 20 30 40 test data CU 50 17 Biology based method 18 FL DLBCL progression p53 related genes identified by filtering procedure CCNB1 MCM7 BRCA1 BCL2A1 PPP2R4 EIF2S2 COMT IARS MPI ALAS1 MRPL3 NCF2 AARS KIF11 CDK4 ATP1B1 CDC20 PRIM1 CDC2 TOP2A CDK2 MYC CCNE1 Gene symbol EPRS PMAIP1 GSK3B ACAA2 COL6A1 E2F5* HRAS POLA SERPING1 HMGB2 CCNA2 PSMB5 CCT6A ACTA2 PRKDC INSR CAD SNRPA TNFRSF1B G1P2 ZNF184* IMPDH1 ALDOA MAP2K2 KARS TOP2A MAD2L1 CXCL1 GOT1 BAG1 CDC25B TOP1 PSMA1 MAP4 KIAA0101 FDFT1 PCNA MTA1 TCF3 CDKN1A CYC1 HLAE* UPP1 PLK1 TOPBP1 CDK7 E2F3 MDM4 AMPD2 RBBP4 CCNG2* HARS CASP6 RPS6KA1 GRP58 TP53 SMAD2 ATP5C1 TIMP3 THBS2 MYCBP DTR TIMP3 CBS CDKN2D* RELA 19 p53 pattern data Positive patterns Negative FL DLBCL patterns WI data 20 CU data Examples of p53 responsive genes patterns Prevalence (%) Pos Neg Pos Neg >0.11 93 90 69 3 3 3 11 11 11 74 68 68 86 71 64 14 21 7 29 29 14 71 57 71 CBS P1 >-0.66 >-0.89 P2 >-0.66 >-0.78 P3 >-0.8 >-0.33 N1 £-0.66 N2 £-0.56 £-0.18 N3 £-0.11 Test set E2F5 Training set CDC2 KIAA0101 CCNE1 BCL2A1 CCNB1 MCM7 Pattern Gene symbol WI data: Each DLBCL case satisfies one of the patterns P1, P2, P3 Each FL case satisfies one of the patterns N1, N2, N3 21 p53 combinatorial biomarker 90 80 70 79% DLBCL & 23% FL cases (3.4 fold) at least two genes over-expressed 60 % cases 77% FL & 21% DLBCL cases (3.7 fold) at most one gene over-expressed 50 DLBCL 40 FL 30 20 10 0 Each individual gene: over- expressed in about 40-70% DLBCL & 20-40% FL (specificity 50-60%, sensitivity 60-70%) <= 1 >=2 # of over-expressed genes in DLBCL vs. FL (p53, PLK1, CDK2) 22 What are these genes? Plk1 (stpk13): polo-like kinase serine threonine protein kinase 13, M-phase specific cell transformation, neoplastic, drives quiescent cells into mitosis over-expressed in various human tumors Takai et al., Oncogene, 2005: plk1 potential target for cancer therapy, new prognostic marker for cancer Mito et al, Leuk Lymph, 2005: plk1 biomarker for DLBCL Cdk2 (p33): cyclin -dependent kinase: G2/M transition of mitotic cell cycle, interacts with cyclins A, B3, D, E P53 tumor suppressor gene (Levine 1982) 23 Conclusions Pattern-based meta-classifier is robust against noise Good prediction of FL DLBCL Biology based analysis also possible Yields useful biomarker Should study biologically motivated sets of genes build pathways 24 <> Thank you for your attention ! 25