Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen Chen & Huilin Xiong) EECS & ITTC University of Kansas 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Outline Introduction Data-dependent Kernel Results Conclusion 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 2 Cancer facts Cancer is a group of many related diseases Cancer is the second leading cause of death in the United States Cancer causes 1 of every 4 deaths NIH estimate overall costs for cancer in 2004 at $189.8 billion ($64.9 billion for direct medical cost) Cancer types Cells continue to grow and divide and do not die when they should. Changes in the genes that control normal cell growth and death. Breast cancer, Lung cancer, Colon cancer, … Death rates vary greatly by cancer type and stage at diagnosis 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 3 Motivation Why do we need to classify cancers? The general way of treating cancer is to: Categorize the cancers in different classes Use specific treatment for each of the classes Traditional way to classify cancers Morphological appearance Not accurate! Enzyme-based histochemical analyses. Immunophenotyping. Cytogenetic analysis. Complicated & needs highly specialized laboratories 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 4 Motivation Why traditional ways are not enough ? There exists some tumors in the same class with completely different clinical courses May be more accurate classification is needed Assigning new tumors to known cancer classes is not easy e.g. assigning an acute leukemia tumor to one of the 2016/6/30 AML (acute myeloid leukemia) ALL (acute lymphoblastic leukemia) DIMACS Workshop on Machine Learning Techniques in Bioinformatics 5 DNA Microarray-based Cancer Diagnosis Cancer is caused by changes in the genes that control normal cell growth and death. Molecular diagnostics offer the promise of precise, objective, and systematic cancer classification These tests are not widely applied because characteristic molecular markers for most solid tumors have to be identified. Recently, microarray tumor gene expression profiles have been used for cancer diagnosis. 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 6 Microarray A microarray experiment monitors the expression levels for thousands of genes simultaneously. Microarray techniques will lead to a more complete understanding of the molecular variations among tumors, hence to a more reliable classification. Low Zero High 2016/6/30 C1 C2 C3 C4 C5 C6 C7 G1 G2 G3 G4 G5 G6 G7 G6 G7 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 7 Microarray Microarray analysis allows the monitoring of the activities of thousands of genes over many different conditions. From a machine learning point of view… Gene\Experiment ex-1 ex-2 …… ex-m g-1 g-2 ……. ……. g-n The large volume of the data requires the computational aid in analyzing the expression data. 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 8 Machine learning tasks in cancer classification There are three main types of machine learning problems associated with cancer classification: The identification of new cancer classes using gene expression profiles The classification of cancer into known classes The identifications of “marker” genes that characterize the different cancer classes In this presentation, we focus on the second type of problems. 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 9 Project Goals To develop a more systematic machine learning approach to cancer classification using microarray gene expression profiles. Use an initial collection of samples belonging to the known classes of cancer to create a “class predictor” for new, unknown, samples. 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 10 Challenges in cancer classification Gene expression data are typically characterized by high dimensionality (i.e. a large number of genes) small sample size Curse of dimensionality! Methods 2016/6/30 Kernel techniques Data resampling Gene selection AML DIMACS Workshop on Machine Learning Techniques in Bioinformatics 11 Outline Introduction Data-dependent Kernel Results Conclusion 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 12 Data-dependent kernel model Data dependent k ( x, y ) q( x)q( y )k0 ( x, y ) k0 ( x, y) is a basic kernel function m q( x ) 0 i k1 ( x, xi ) i 1 k1 ( x, xi ) e 1||x xi ||2 Optimizing the data-dependent kernel is to choose the coefficient vector ( 0 , 1 , , l ) T 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 13 Optimizing the kernel Criterion for kernel optimization Maximum class separability of the training data in the kernel-induced feature space tr ( S b ) max J tr ( S w ) 2016/6/30 Sb : between - class scatter matrix S w : within - class scatter matrix DIMACS Workshop on Machine Learning Techniques in Bioinformatics 14 The Kernel Optimization max J tr ( S b ) tr ( S w ) T M 0 max J T M 0 , N0 are functions of K0 and K1 N 0 N0 is nonsingula r M 0 N 0 In reality, the matrix N0 is usually singular M 0 ( N 0 I ) 0 α: eigenvector corresponding to the largest eigenvalue 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 15 Kernel optimization Training data Test data Before Kernel Optimization After Kernel Optimization 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 16 Distributed resampling {xi , yi } (i 1,2,...m) Original training data: Training data with resampling: {ai , bi } (i 1,2,...3m) xi 1 i m ai xr i m yi 1 i m bi yr i m ~ N (0, ) 2 xr : random sample of {xi } with replacemen t 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 17 Gene selection A filter method: class separability 2 m ( x ( j ) x ( j )) k k 1 k 2 g ( j) 2 k 1 iC k ( xi ( j ) x k ( j )) 2 Ck : index set of k - th class mk : the number of samples in Ck x k ( j ) : average expression across k-th class x( j ) : average expression of all training samples 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 18 Outline Introduction Data-dependent Kernel Results Conclusion 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 19 Comparison with other methods k-Nearest Neighbor (kNN) Diagonal linear discriminant analysis (DLDA) Uncorrelated Linear Discriminant analysis (ULDA) Support vector machines (SVM) 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 20 Data sets AML Subtypes: ALL vs. AML Status of Estrogen receptor Status of lymph nodal Outcome of treatment Tumor vs. healthy tissue Subtypes: MPM vs. ADCA Different lymphomas cells Cancer vs. non-cancer Tumor vs. healthy tissue 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 21 Experimental setup Data normalization Zero mean and unity variance at the gene direction Random partition data into two disjoint subsets of equal size – training data + test data Repeat each experiment 100 times 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 22 Parameters DLDA: no parameter KNN: Euclidean distance, K=3 ULDA: K=3 SVM: Gaussian kernel, use leave-one-out on the training data to tune parameters KerNN: Gaussian kernel for basic kernel k0, γ0 andσare empirically set. Use leave-one-out on the training data to tune the rest parameters. KNN for classification 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 23 Effect of data resampling 2016/6/30 Lung Prostate 181 samples 102 samples DIMACS Workshop on Machine Learning Techniques in Bioinformatics 24 Effect of gene selection ALL-AML 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 25 Effect of gene selection Colon 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 26 Effect of gene selection Prostate 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 27 Comparison results ALL-AML BreastLN 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics BreastER Colon 28 Comparison results CNS Ovarian 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics lung Prostate 29 Outline Introduction Data-dependent Kernel Results Conclusion 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 30 Conclusion By maximizing the class separability of training data, the data-dependent kernel is also able to increase the separability of test data. The kernel method is robust to high dimensional microarray data The distributed resampling strategy helps to alleviate the problem of overfitting 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 31 Conclusion The classifier assign samples more accurately than other approaches so we can have better treatments respectively. The method can be used for clarifying unusual cases e.g. a patient which was diagnosed as AML but with atypical morphology. The method can be applied to distinctions relating to future clinical outcomes. 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 32 Future work How to estimate the parameters Study the genes selected 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 33 Reference H. Xiong, M.N.S. Swamy, and M.O. Ahmad. Optimizing the data-dependent kernel in the empirical feature space. IEEE Trans. on Neural Networks 2005, 16:460-474. H. Xiong, Y. Zhang, and X. Chen. Data-dependent Kernels for Cancer Classification. Under review. A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue classification with gene expression profiles. J. Computational Biology 2000, 7:559-584. S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination method for the classification of tumor using gene expression data. J. Am. Statistical Assoc. 2002, 97:77-87 T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16:906-914. J. Ye, T. Li, T. Xiong, and R. Janardan. Using uncorrelated discriminant analysis for tissue classification with gene expression data. IEEE/ACM Trans. on Computational Biology and Bioinformatics 2004, 1:181-190. 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 34 Thanks! Questions? 2016/6/30 DIMACS Workshop on Machine Learning Techniques in Bioinformatics 35