1 D DISCRETE WAVELET TRANSFORM FOR CLASSIFICATION OF CANCER SAMPLES IN DNA MICROARAY DATA Adarsh 1 Jose , Dale 1 Mugler , PhD, Zhong-Hui 2,3 Duan , PhD. 1. Department of Biomedical Engineering, The University of Akron 2. Department of Computer Science, The University of Akron 3. Integrated Biosciences Program, The University of Akron Abstract The most important problem in applying Supervised Learning methods for classifying cancer samples using the gene expression profiles, is the limited availability of the samples. So selecting the relevant features is imperative for optimizing the classification algorithms. A feature(gene) selection method using 1D Discrete Wavelet Transforms is proposed for addressing ‘two class’ problems in DNA microarray data. Gene Expression: The process by which encoded information from DNA is converted into actual structures in cells. The subset of ‘expressed genes’ and their ‘expression levels’ form a characteristic of the state of the cell. DNA microarrays: Allows measurement of expression levels of thousands of genes simultaneously. Entire genome can be probed at a single point of time. It is based on base pair attraction between complementary pairs in the DNA and RNA strands. The microarray technology quantifies the notion of gene expression. Classification problem of microarray data Training sets with class labels Feature Selection Training Classifier Validation using Testing Set Problem : The number of features(genes) is very large compared to the number of samples. Solution: To reduce the feature size by ‘selecting’ or ‘extracting’ the ‘most relevant’ features. What is the wavelet transform ? Datasets • Leukemia dataset - 48 ALL & 25 AML Samples • B-Cell Lymphoma dataset – 58 DLBLC & 10 FCC Results & Observations • The algorithm was tested for classification accuracy on the oligonucleotide datasets by using KNN Classifier and 3 different validation methods for different variable sizes . • ‘Haar’ and ‘Bior1.5’ wavelets gave accuracy of up to 97%. • The average classification error is less than 11% in both the oligonucleotide datasets studied. • ‘Shuffling’ the samples within each class ‘DOES NOT’ have any effect on the accuracy. Procedure 1 D Discrete Wavelet Transform • Break down the signal into different frequency bands. • Implemented by sending the signal through a series of high pass and low pass half band filters. Conclusions Wavelet Decomposition Examples • 1-D Discrete Wavelets can capture patterns in Gene – Expression data which makes it a potential tool for feature selection. • A complete Error Estimation study has to be carried out with microarray data obtained from different platforms. References 1. T.R Golub et al. Molecular Classification of cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, www.sciencemag.org, SCIENCE, VOL 286 (1999) • The samples are grouped into the 2 classes. • 1-D Discrete Wavelet Transform to Level 3 of gene was taken. • Gene expression profile reconstructed using Level 3 approx. only. • Score = abs(mean(class1) – mean(class2)) • Genes were ranked by their scores . 2. Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR. Diffuse large B-cell lymphoma outcome prediction by geneexpression profiling and supervised machine learning Nat Med 2002 Jan;8(1):68-74. 3.Matlab manual – Matlab Wavelet toolbox, Matlab Bioinformatics toolbox. Mathworks