Medical Diagnosis Decision-Support System: Optimizing Pattern Recognition of Medical Data W. Art Chaovalitwongse Industrial & Systems Engineering Rutgers University Center for Discrete Mathematics & Theoretical Computer Science (DIMACS) Center for Advanced Infrastructure & Transportation (CAIT) Center for Supply Chain Management, Rutgers Business School This work is supported in part by research grants from NSF CAREER CCF-0546574, and Rutgers Computing Coordination Council (CCC). Outline Introduction Pattern-Based Classification Framework Application in Epilepsy Classification: Model-Based versus Pattern-Based Medical Diagnosis Seizure (Event) Prediction Identify epilepsy and non-epilepsy patients Application in Other Diagnosis Data Conclusion and Envisioned Outcome 2 Pattern Recognition: Classification Supervised learning: A class (category) label for each pattern in the training set is provided. Positive Class ? Negative Class 3 Model-Based Classification Linear Discriminant Function d T gi x | wi , wi 0 wi x wi 0 wij x j wi 0 Support Vector Machines Attributes j 1 || w ||2 N k min L( w) C i 2 i 1 subject to Samples if w x i b 1-i 1 f ( xi ) 1 if w x i b 1 i Class or Category Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Neural Networks nH d gk ( x) zk f wkj f w ji xi w j 0 wk 0 i 1 j 1 4 Support Vector Machine A and B are data matrices of normal and pre-seizure, respectively e is the vector of ones is a vector of real numbers is a scalar u, v are the misclassification errors Mangasarian, Operations Research (1965); Bradley et al., INFORMS J. of Computing (1999) Pattern-Based Classification: Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck Compute Distance Training Records Test Record Choose k of the “nearest” records 6 Traditional Nearest Neighbor X (a) 1-nearest neighbor X X (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x 7 Drawbacks Feature Selection Sensitive to noisy features Optimizing feature selection n features, 2n combinations combinatorial optimization Unbalanced Data Biased toward the class (category) with larger samples Distance weighted nearest neighbors Pick the k nearest neighbors from each class (category) to the training sample and compare the average distances. 8 Multidimensional Time Series Classification in Medical Data Normal ? Abnormal Positive versus Negative Responsive versus Unresponsive Multidimensional Time Series Classification Multisensor medical signals (e.g., EEG, ECG, EMG) Multivariate is ideal but computationally impossible It is very common that physicians always use baseline data as a reference for diagnosis The use of baseline data naturally lends itself to nearest neighbor classification 9 Ensemble Classification for Multidimensional time series data Use each electrode as a base classifier Each base classifier makes its own decision Multiple decision makers - How to combine them? Voting the final decision Averaging the prediction score Suppose there are 25 base classifiers Each classifier has error rate, = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction (voting): 25 i 25i ( 1 ) 0.06 i i 13 25 10 Modified K-Nearest Neighbor for MDTS Normal Abnormal K=3 Ch 1 Ch 2 Ch 3 ……………. D(X,Y) Ch n Time series distances: (1) Euclidean, (2) T-Statistical, (3) Dynamic Time Warping 11 Dynamic Time Warping (DTW) The minimum-distance warp path is the optimal alignment of two time series, K where the distance of a warp path W is: Dist(W ) Dist( w , w ) k 1 k ,s k ,t Dist(W ) is the Euclidean distance of warp path W. Dist ( wk ,s , wk ,t ) is the distance between the two data point indices (from Li and Lj) in the kth element of the warp path. Dynamic Programming: Ds, t Dist Lsi , Ltj min Ds 1, t , Ds, t 1, Ds 1, t 1 The optimal warping distance is D30,30 Figure B) Is from Keogh and Pazzani, SDM (2001) 12 Optimizing Pattern Recognition Traditional Pattern-Based Classification Baseline Data Cleansed Data Signal Processing (Feature Extraction) Extracted Features Feature Selection Selected Features of All Baseline Data Classifying New Samples Proposed Pattern-Based Classification Baseline Data Cleansed Data Signal Processing (Feature Extraction) Extracted Features Selecting Good Baseline Data and Deleting Outliers Integrated Feature Selection & Pattern Matching Optimization Optimally Selected Features of Optimized Baseline Data Classifying New Samples 13 Support Feature Machine Given an unlabeled sample A, we calculate average statistical distances of A↔Normal and A↔Abnormal samples in baseline (training) dataset per electrode (channel). Statistical distances: Euclidean, T-statistics, Dynamic Time Warping Combining all electrodes, A will be classified to the group (normal or abnormal) that yields the minimum average statistical distance; or the maximum number of votes Can we select/optimize the selection of a subset of electrodes that maximizes number of correctly classified samples 14 SFM: Averaging and Voting Two distances for each sample at each electrode are calculated: Intra-Class: Average distance from each sample to all other samples in the same class at Electrode j Inter-Class: Average distance from each sample to all other samples in different class at Electrode j Averaging: If for Sample i (on average of selected electrodes) Average intra-class Average inter-class distance over all < distance over all electrodes electrodes We claim that Sample i is correctly classified. Voting: If for Sample i at Electrode j (vote) Intra-class distance < Inter-class distance (good vote) Based on selected electrodes, if # of good votes > # of bad votes, then Sample i is correctly classified. Chaovalitwongse et al., KDD (2007) and Chaovalitwongse et al., Operations Research (forthcoming) Distance Averaging: Training di 2 d i1 d i1 di 2 Sample i at Feature 1 Sample i at Feature 2 d im ∙∙∙ d im Sample i at Feature m Select a subset of features ( s 1,2,..., m ) such that as many samples as possible. dij dij j s j s Industrial & Systems Engineering Rutgers University 16 Majority Voting: Training Negative dij Positive i Negative di j dij Feature j Positive i’ di j Feature j aij 1 (Correct) if dij dij ; aij 0 (Incorrect) otherwise. Industrial & Systems Engineering Rutgers University 17 SFM Optimization Model n total number of samples. m total number of electrodes. Intra-Class dij average distance from sample i to all other samples in the same class, for i 1...n and j 1...m. Inter-Class dij average distance from sample i to all other samples in different class, for i 1...n and j 1...m. 1 if sample i is correctly classified; yi 0 otherwise, for i 1,..., n. 1 if electrode j is selected; xj 0 otherwise, for j 1,..., m. Chaovalitwongse et al., KDD (2007) and Chaovalitwongse et al., Operations Research (forthcoming) Averaging SFM n max yi Maximize the number of correctly classified samples i 1 s.t. m m j 1 j 1 m m dij x j dij x j M 1 yi d j 1 ij x j dij x j M 2 1 yi for i 1,..., n j 1 m x j 1 for i 1,..., n j 1 Logical constraints on intra-class and inter-class distances if a sample is correctly classified Must select at least one electrode x j 0,1 for j 1,..., m yi 0,1 for i 1,..., n Chaovalitwongse et al., KDD (2007) and Chaovalitwongse et al., Operations Research (forthcoming) Voting SFM n max Maximize the number of correctly classified samples yi i 1 m s.t. xj m a x 2 ij j 1 m xj j j 1 m 2 a x j 1 j 1 m x j 1 j 1 ij j M 1 yi for i 1,..., n M 2 1 yi for i 1,..., n Must select at least one electrode x j 0,1 for j 1,..., m yi 0,1 for i 1,..., n 0 Logical constraints: Must win the voting if a sample is correctly classified Precision matrix, A contains elements of 1 if sample i is correctly classified at electrode j (good vote); aij 0 otherwise (bad vote), for i 1,..., n and j 1,..., m. Chaovalitwongse et al., KDD (2007) and Chaovalitwongse et al., Operations Research (forthcoming) Support Feature Machine Training Normal Samples Testing Abnormal Samples Step 1: For individual feature (electrode), apply the nearest neighbor rule to every training sample to construct the distance and accuracy matrices – voting matrix – distance matrices Step 2: Formulate and solve the SFM models and obtain the optimal feature (electrode) selection Unlabeled Samples Step 3: Employ the nearest neighbor rule to classify unlabeled data to the closest baseline (training) data based on the selected features (electrodes) x – electrode selection y – training accuracy Support Feature Machine 21 Support Vector Machine Feature 3 Pre-Seizure A data vector of EEG sample 1 Ch 1 Ch 2 Ch 3 ……………. Feature 2 Normal Feature 1 Ch n 2 3 4 …… n Application in Epilepsy Diagnosis 23 Facts about Epilepsy About 3 million Americans and other 60 million people worldwide (about 1% of population) suffer from Epilepsy. Epilepsy is the second most common brain disorder (after stroke), which causes recurrent seizures (not vice versa). Seizures usually occur spontaneously, in the absence of external triggers. Epileptic seizures occur when a massive group of neurons in the cerebral cortex suddenly begin to discharge in a highly organized rhythmic pattern. Seizures cause temporary disturbances of brain functions such as motor control, responsiveness and recall which typically last from seconds to a few minutes. Based on 1995 estimates, epilepsy imposes an annual economic burden of $12.5 billion* in the U.S. in associated health care costs and losses in employment, wages, and productivity. Cost per patient ranged from $4,272 for persons** with remission after initial diagnosis and treatment to $138,602 for persons** with intractable and frequent seizures. *Begley et al., Epilepsia (2000); **Begley et al., Epilepsia (1994). 24 Simplified EEG System and Intracranial Electrode Montage ROF 43 2 1 RST 1234 4321 1 2 34 1 2 1 2 3 3 4 5 RTD LOF LST 4 5 LTD Electroencephalogram (EEG) is a traditional tool for evaluating the physiological state of the brain by measuring voltage potentials produced by brain cells while communicating 25 Scalp EEG Acquisition F p1 F7 T3 F3 F p2 Fz C3 F8 C4 Pz P3 F4 T4 P4 T5 T6 O1 O2 Oz 18 Bipolar Channels Goals: How can we help? Seizure Prediction Recognizing (data-mining) abnormality patterns in EEG signals preceding seizures Normal versus Pre-Seizure Alert when pre-seizure samples are detected (online classification) e.g., statistical process control in production system, attack alerts from sensor data, stock market analysis EEG Classification: Routine EEG Check Quickly identify if the patients have epilepsy Epilepsy versus Non-Epilepsy Many causes of seizures: Convulsive or other seizure-like activity can be non-epileptic in origin, and observed in many other medical conditions. These non-epileptic seizures can be hard to differentiate and may lead to misdiagnosis. e.g., medical check-up, normal and abnormal samples 27 Normal versus Pre-Seizure 28 10-second EEGs: Seizure Evolution Normal Seizure Onset Chaovalitwongse et al., Annals of Operations Research (2006) Pre-Seizure Post-Seizure 29 Normal versus Pre-Seizure Data Set EEG Dataset Characteristics Patient ID Seizure types Duration of EEG(days) # of seizures 1 CP, SC 3.55 7 2 CP, GTC, SC 10.93 7 3 CP 8.85 22 4 ,SC 5.93 19 5 CP, SC 13.13 17 6 CP, SC 11.95 17 7 CP, SC 3.11 9 8 CP, SC 6.09 23 9 CP, SC 11.53 20 10 CP 9.65 12 84.71 153 Total CP: Complex Partial; SC subclinical; GTC: Generalized Tonic/Clonic Sampling Procedure Randomly and uniformly sample 3 EEG epochs per seizure from each of normal and pre-seizure states. For example, Patient 1 has 7 seizures. There are 21 normal and 21 pre-seizure EEG epochs sampled. Use leave-one(seizure)-out cross validation to perform training and testing. Normal 8 hours 8 hours 30 minutes Seizure 8 hours 8 hours Pre-seizure 30 minutes Seizure Duration of EEG Measure the brain dynamics from EEG signals Apply dynamical measures (based on chaos theory) to non-overlapping EEG epochs of 10.24 seconds = 2048 points. Maximum Short-Term Lyapunov Exponent measure the stability/chaoticity of EEG signals measure the average uncertainty along the local eigenvectors and phase differences of an attractor in the phase space Pardalos, Chaovalitwongse, et al., Math Programming (2004) EEG Voltage Information/Feature Extraction from EEG Signals Time Evaluation Sensitivity measures the fraction of positive cases that are classified as positive. Specificity measures the fraction of negative cases classified as negative. Sensitivity = TP/(TP+FN) Specificity = TN/(TN+FP) Type I error = 1-Specificity Type II error = 1-Sensitivity Chaovalitwongse et al., Epilepsy Research (2005) Leave-One-Seizure-Out Cross Validation N1 P1 SFM N2 N3 P3 N4 P4 N5 Selected Electrodes P2 P5 1 2 3 4 5 6 7 . . . 23 24 25 26 Training Set Testing Set N – EEGs from Normal State P – EEGs from Pre-Seizure State assume there are 5 seizures in the recordings 34 EEG Classification Support Vector Machine [Chaovalitwongse et al., Annals of OR (2006)] Ensemble K-Nearest Neighbor [Chaovalitwongse et al., IEEE SMC: Part A (2007)] Project time series data in a high dimensional (feature) space Generate a hyperplane that separates two groups of data – minimizing the errors Use each electrode as a base classifier Apply the NN rule using statistical time series distances and optimize the value of “k” in the training Voting and Averaging Support Feature Machine [Chaovalitwongse et al., SIGKDD (2007); Chaovalitwongse et al., Operations Research (forthcoming)] Use each electrode as a base classifier Apply the NN rule to the entire baseline data Optimize by selecting the best group of classifiers (electrodes/features) Voting: Optimizes the ensemble classification Averaging: Uses the concept of inter-class and intra-class distances (or prediction scores) 35 Performance Characteristics: Upper Bound NN -> Chaovalitwongse et al., Annals of Operations Research (2006) SFM -> Chaovalitwongse et al., SIGKDD (2007); Chaovalitwongse et al., Operations Research (forthcoming) KNN -> Chaovalitwongse et al., IEEE Trans Systems, Man, and Cybernetics: Part A (2007) 37 Separation of Normal and PreSeizure EEGs From 3 electrodes selected by SFM From 3 electrodes not selected by SFM Performance Characteristics: Validation SVM-> Chaovalitwongse et al., Annals of Operations Research (2006) SFM -> Chaovalitwongse et al., SIGKDD (2007); Chaovalitwongse et al., Operations Research (forthcoming) KNN -> Chaovalitwongse et al., IEEE Trans Systems, Man, and Cybernetics: Part A (2007) 39 Epilepsy versus Non-Epilepsy 40 Epilepsy versus Non-Epilepsy Data Set Non-Epilepsy patients Epilepsy patients Elec 1 Elec 2 ….. ….. Elec 17 Elec 18 150 points (25 minutes) 30 points 30 points 5 sampled epochs Routine EEG check: 25-30 minutes of recordings ~ with scalp electrodes Each sample is 5-minute EEG epoch (30 points of STLmax values). Each sample is in the form of 18 electrodes X 30 points Leave-One-Patient-Out Cross Validation N1 E1 SFM N2 N3 E3 N4 E4 N5 Selected Electrodes E2 E5 1 2 3 4 5 6 7 . . . 23 24 25 26 Training Set Testing Set N – Non-Epilepsy P – Epilepsy 42 Voting SFM: Validation Voting SFM Performance – Average of 10 Patients 100% 90% Overall Accuracy 80% 70% 60% DTW 50% EU TS 40% 30% 20% 10% 0% KNN k=5 SFM KNN k=7 SFM KNN k=9 SFM KNN k=11 SFM KNN k=All SFM 43 Averaging SFM: Validation Averaging SFM Performance – Average of 10 Patients 100% 90% Overall Accuracy 80% 70% 60% DTW 50% EU 40% TS 30% 20% 10% 0% KNN k=5 SFM KNN k=7 SFM KNN k=9 SFM KNN k=11 SFM KNN k=All SFM 44 Selected Electrodes From Averaging SFM 100% Averaging SFM - DTW Averaging SFM - EU 90% Averaging SFM - TS 80% Selection Percentage Fp1 – C3 T6 – Oz Fz – Oz 1 16 17 70% 60% 50% Fp1 Fp2 40% F7 F3 Fz F4 F8 30% T3 C3 C4 T4 20% Pz P3 P4 T5 T6 10% O1 O2 Oz 0% 1 2 3 4 5 6 7 8 9 10 11 12 Electrode 13 14 15 16 17 18 Other Medical Diagnosis 46 Other Medical Datasets Breast Cancer Diabetes Patient Records (Age, body mass index, blood pressure, etc.) Diabetic or Not Heart Disease Features of Cell Nuclei (Radius, perimeter, smoothness, etc.) Malignant or Benign Tumors General Patient Info, Symptoms (e.g., chest pain), Blood Tests Identify Presence of Heart Disease Liver Disorders Features of Blood Tests Detect the Presence of Liver Disorders from Excessive Alcohol Consumption 47 Performance Training LP SVM NLP SVM Testing V-SFM A-SFM LP SVM NLP SVM V-NN A-NN V-SFM A-SFM WDBC 98.08 96.17 97.28 97.42 97.00 95.38 91.60 93.18 94.99 96.01 HD 85.06 84.66 86.48 86.92 82.96 83.94 80.87 82.77 82.49 84.92 PID 77.66 77.51 75.01 77.96 76.93 76.09 63.14 74.94 72.75 75.83 BLD 65.71 57.97 63.46 66.43 65.71 57.97 38.38 54.09 58.20 59.57 48 Average Number of Selected Features WDBC HD PID BLD LP SVM NLP SVM 30 30 V-SFM A-SFM 11.6 8.5 13 13 8 8 7.4 8.7 4.3 4.5 6 6 3.3 3.7 49 Medical Data Signal Processing Apparatus (MeDSPA) Quantitative analyses of medical data Neurophysiological data (e.g., EEG, fMRI) acquired during brain diagnosis Envisioned to be an automated decision-support system configured to accept input medical signal data (associated with a spatial position or feature) and provide measurement data to help physicians obtain a more confident diagnosis outcome. To improve the current medical diagnosis and prognosis by assisting the physicians recognizing (data-mining) abnormality patterns in medical data recommending the diagnosis outcome (e.g., normal or abnormal) identifying a graphical indication (or feature) of abnormality (localization) 50 Automated Abnormality Detection Paradigm Data Acquisition Multichannel Brain Activity Optimization: Feature Extraction/ Clustering Interface Technology Feature 3 Statistical Analysis: Pattern Recognition Feature 2 Feature 1 Nurse Stimulator User/Patient Drug Initiate a warning or a variety of therapies (e.g., electrical stimulation, drug injection) Acknowledgement: Collaborators E. Micheli-Tzanakou, PhD L.D. Iasemidis, PhD R.C. Sachdeo, MD R.M. Lehman, MD B.Y. Wu, MD, PhD Students Y.J. Fan, MS Other undergrad students 52 Thank you for your attention! Questions? 53