Using Ensemble Models in the Histological Examination of Tissue Abnormalities M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12th Annual Research Day Friday, May 2nd, 2014 Objective The objective of this study is: 1. to investigate the possibility of automatically identifying abnormalities in tissue samples through the use of an ensemble model on data generated by histological examination 2. to minimize the number of false negative cases. Introduction As part of breast cancer prevention screening if a lump is found a fineneedle aspiration biopsy (FNAB) is performed. Normally the sample is analyzed visually by a pathologist that looks for cancerous tissues with abnormal characteristics. This procedure is time consuming Automatic procedures do exists that evaluate cytology features derived from a digital scan of breast FNAB slides. These procedure achieve very high accuracies, and better than manual procedure, but still have a certain level of false negative Our goal is to reduce the false negative rate The Data Wisconsin Breast Cancer Dataset Containing 569 samples classified as “normal” or “abnormal” 12 attributes Dataset split: Training set: 448 samples. Test set: 121 samples. The Data Cont.… Table Structure (12 fields) Id Diagnosis (A=Abnormal/ N=Normal) Radius (mean of distances from center to points on the perimeter) Texture (standard deviation of gray-scale values) Perimeter Area Smoothness (local variation in radius lengths) Compactness (perimeter^2 / area - 1.0) Concavity (severity of concave portions of the contour) Concave points (number of concave portions of the contour) Symmetry Fractal dimension ("coastline approximation" - 1) Exploratory Data Analysis The data set was of very good quality No missing values Outliers detected through the use of Z-Score, with a possible outlier falling outside of the interval [-4,+4] 𝑥∗ 𝑥 − 𝜇𝑥 = 𝜎𝑥 We detected some outliers, but further investigation excluded errors in the data. Exploratory Data Analysis (Cont.…) Normality Assumption: variables normally distributed within acceptable variations Skewness within [-2,+2] Kurtosis within [-2,+2] Exploratory Data Analysis (Cont.…) Normalization: to avoid that variables will influence the model due to their scales we normalized the data using the min-Max transformation 𝑥 − 𝑚𝑖𝑛𝑥 𝑥 = 𝑚𝑎𝑥𝑥 − 𝑚𝑖𝑛𝑥 ∗ All resulting variables were within the interval of [0,1] Exploratory Data Analysis (Cont.…) Normalization: to avoid that variables will influence the model due to their scales we normalized the data using the min-Max transformation 𝑥∗ 𝑥 − 𝑚𝑖𝑛𝑥 = 𝑚𝑎𝑥𝑥 − 𝑚𝑖𝑛𝑥 All resulting variables were within the interval of [0,1] Correlation We kept radius and dropped the other variables. Clustering Derived a new “cluster” variable by applying the KMeans algorithm with k=2. Modeling Due to the characteristics of the data we applied two algorithms CART (with misclassification costs) Logistic Regression Confusion Matrixes & Error Rates Ensemble Model We leveraged the confidence interval measures produced by these models. Applied a voting scheme in which the prediction with the highest confidence wins. 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑒𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = 4+1 = 0.04 = 4% 121 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑟𝑎𝑡𝑒 = 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑟𝑎𝑡𝑒 = 1 = 0.01 = 1% 77 4 = 0.9 = 9% 44 Conclusions The voting-based ensemble model derived through the combination of decision trees and logistic regression proved to be a very efficient way of helping in improving the detection of abnormal biopsy samples. The very low false negative rate of 1% is a clear indication that this problem can be solved by the generation of high quality classification solutions, representing an improvement when compared to other classification systems developed in the past. References E. D. Pisano, L. L. Fajardo, D. J. Caudry, N. Sneige, W. J. Frable, W. A. Berg, I. Tocino, S. J. Schnitt, J. L. Connolly, C. A. Gatsonis, and B. J. McNeil, Fine-Needle Aspiration Biopsy of Nonpalpable Breast Lesions in a Multicenter Clinical Trial, Radiology, 2001, Vol. 219, Issue 3, pp. 785-792 W. H. Wolberg, W. N. Street, O. L. Mangasarian, Breast Cytology Diagnosis Via Digital Image Analysis, Dept. of Surgery, Universit of Wisconsin, 1993 W. Wolberg, W.N. Street, O.L. Mangasarian, Importance of nuclear morphology in breast cancer prognosis, Clinical Cancer Research, (1999) Vol. 5, 3542-3548 B. Lantz, “Machine Learning with R”, Packt Publishing, 2013 UCI-Machine Learning Repository, http://archive.ics.uci.edu/ml/ D. Larose, Discovering Knowledge in Data, Wiley, 2005. G. Seni and J. F. Elder, Ensemble Methods in Data Mining, Morgan & Claypool Publishers, 2009. J. F. Elder and S. S. Lee, Bundling Heterogeneous Classifiers with Advisor Perceptrons, University of Idaho, Technical Report, Oct. 1997.