International Biometric Society LASSO CLUSTERING METHOD FOR CLASSIFICATION OF CANCER SUBTYPES USING MICROARRAY DATA Masaru Ushijima 1, Shinto Eguchi 2, Osamu Komori 2, Yoshio Miki 1, Masaaki Matsuura 1 1: Genome Center, Japanese Foundation for Cancer Research, Tokyo, Japan 2: Institute of Statistical Mathematics, Tokyo, Japan Introduction: Importance of molecular portraits of a cancer and possibility of classifying a cancer into biologically and molecularly distinct groups have been recognized in the fields of cancer research and clinical management. In order to find useful and high performance biomarkers for detecting a subtype of each cancer patient, effective selection of gene set from molecular profile data would be crucial. In this study, we developed a method for selecting genes for classification of breast cancer subtypes using a novel lasso type clustering. Material and Methods: The method we developed is based on a stepwise gene set decreasing algorithm in a series of analyses of ordinary k-means clustering. We introduce L1 penalty for the centers obtained by the ordinary k-means clustering method. In this algorithm, genes satisfying a condition derived from the lasso theory with a prespecified parameter are excluded from a candidate set of genes. We iteratively evaluate the number of overlapped subjects in specific clusters in the previous results and the current one using the ordinary k-means clustering analysis. When the rate of overlapped index becomes large, then we change the parameter to keep high level of overlapped rate in each step. We applied our method to microarray data from 417 breast cancer patients treated at the Cancer Hospital of JFCR to examine its performance. Additionally, we examined a possibility whether each subject has more than one subtype, based on iterative random selections of subjects for k-means clustering analysis. Results: In advance of our main analyses, we examined the number of subtypes of our 417 patients using the Consensus clustering (Monti et al. Machine Learning, 2003), and we found that four subtypes (luminal A/B, Her2-enriched, and basal-like) were stable for our data. To confirm the prediction performance for our gene selection method, we randomly divided the data of 417 subjects into 300 subjects training data for gene selection and 117 subjects test data for prediction with 1000 repetition. The median of the number of selected genes for 300 training subjects was 284 and the median of the concordance rate of prediction with the result of ordinary k-means clustering was 88.9% for 117 test subjects. In this analysis, we found that subjects with more than one subtype affect the prediction performance. When we excluded 53 subjects with more than one subtype, the median of the accuracy went up 92.2%. These subjects with more than one subtype might affect the training and prediction performance. Conclusion: We developed a novel gene selection method for multi-subtype classification based on a Lasso clustering method. We confirmed good performance of our method in a framework of prediction problem. Furthermore, we found that subjects with more than one subtype affect the prediction performance. International Biometric Conference, Florence, ITALY, 6 – 11 July 2014