Pattern Recognition:Classification

advertisement
Medical Diagnosis Decision-Support
System: Optimizing Pattern
Recognition of Medical Data
W. Art Chaovalitwongse
Industrial & Systems Engineering
Rutgers University
Center for Discrete Mathematics & Theoretical Computer Science (DIMACS)
Center for Advanced Infrastructure & Transportation (CAIT)
Center for Supply Chain Management, Rutgers Business School
This work is supported in part by research grants from NSF CAREER CCF-0546574, and Rutgers Computing Coordination
Council (CCC).
Outline

Introduction




Pattern-Based Classification Framework
Application in Epilepsy




Classification: Model-Based versus Pattern-Based
Medical Diagnosis
Seizure (Event) Prediction
Identify epilepsy and non-epilepsy patients
Application in Other Diagnosis Data
Conclusion and Envisioned Outcome
2
Pattern Recognition:
Classification
Supervised learning: A class
(category) label for each pattern
in the training set is provided.
Positive Class
?
Negative Class
3
Model-Based Classification


Linear Discriminant Function
d
T
gi x | wi , wi 0   wi x  wi 0  wij x j  wi 0
Support Vector Machines
Attributes
j 1
|| w ||2
 N k
min L( w) 
 C   i 
2
 i 1 
subject to
Samples
if w  x i  b  1-i
1
f ( xi )  
1 if w  x i  b  1  i

Class or
Category
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Neural Networks
 nH

 d

gk ( x)  zk  f   wkj f   w ji xi  w j 0   wk 0 
 i 1

 j 1

4
Support Vector Machine





A and B are data matrices of
normal and pre-seizure,
respectively
e is the vector of ones
 is a vector of real numbers
 is a scalar
u, v are the misclassification
errors
Mangasarian, Operations Research (1965); Bradley et al., INFORMS J. of Computing (1999)
Pattern-Based Classification:
Nearest Neighbor Classifiers

Basic idea:

If it walks like a duck, quacks like a duck, then it’s
probably a duck
Compute
Distance
Training
Records
Test
Record
Choose k of the
“nearest” records
6
Traditional Nearest Neighbor
X
(a) 1-nearest neighbor
X
X
(b) 2-nearest neighbor
(c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
7
Drawbacks

Feature Selection


Sensitive to noisy features
Optimizing feature selection


n features, 2n combinations  combinatorial optimization
Unbalanced Data


Biased toward the class (category) with larger
samples
Distance weighted nearest neighbors

Pick the k nearest neighbors from each class (category) to the
training sample and compare the average distances.
8
Multidimensional Time Series
Classification in Medical Data






Normal
?
Abnormal
Positive versus Negative
Responsive versus Unresponsive
Multidimensional Time
Series Classification
Multisensor medical signals
(e.g., EEG, ECG, EMG)
Multivariate is ideal but
computationally impossible
It is very common that
physicians always use
baseline data as a reference
for diagnosis

The use of baseline data naturally lends itself to nearest
neighbor classification
9
Ensemble Classification for
Multidimensional time series data



Use each electrode as a base classifier
Each base classifier makes its own decision
Multiple decision makers - How to combine them?



Voting the final decision
Averaging the prediction score
Suppose there are 25 base classifiers



Each classifier has error rate,  = 0.35
Assume classifiers are independent
Probability that the ensemble classifier makes a wrong prediction
(voting):
 25  i
25i

(
1


)
 0.06



 i 
i 13 

25
10
Modified K-Nearest Neighbor
for MDTS
Normal
Abnormal
K=3
Ch 1
Ch 2
Ch 3
…………….
D(X,Y)
Ch n
Time series distances: (1) Euclidean, (2) T-Statistical, (3) Dynamic Time Warping
11
Dynamic Time Warping (DTW)
The minimum-distance warp path is the optimal alignment of two time series,
K
where the distance of a warp path W is:
Dist(W )   Dist( w , w )
k 1
k ,s
k ,t
Dist(W ) is the Euclidean distance of warp path W.
Dist ( wk ,s , wk ,t ) is the distance between the two data point indices
(from Li and Lj) in the kth element of the warp path.
Dynamic Programming: Ds, t   Dist Lsi , Ltj   min  Ds  1, t , Ds, t  1, Ds  1, t  1
The optimal warping distance is D30,30 
Figure B) Is from Keogh and Pazzani, SDM (2001)
12
Optimizing Pattern
Recognition
Traditional Pattern-Based Classification
Baseline Data
Cleansed Data
Signal Processing
(Feature Extraction)
Extracted Features
Feature Selection
Selected Features
of All Baseline Data
Classifying New Samples
Proposed Pattern-Based Classification
Baseline Data
Cleansed Data
Signal Processing
(Feature Extraction)
Extracted Features
Selecting Good
Baseline Data
and Deleting
Outliers
Integrated Feature
Selection & Pattern
Matching Optimization
Optimally Selected Features
of Optimized Baseline Data
Classifying New Samples
13
Support Feature Machine

Given an unlabeled sample A, we calculate average statistical
distances of A↔Normal and A↔Abnormal samples in
baseline (training) dataset per electrode (channel).

Statistical distances: Euclidean, T-statistics, Dynamic Time
Warping

Combining all electrodes, A will be classified to the group
(normal or abnormal) that yields



the minimum average statistical distance; or
the maximum number of votes
Can we select/optimize the selection of a subset of electrodes
that maximizes number of correctly classified samples
14
SFM: Averaging and Voting

Two distances for each sample at each electrode are calculated:
 Intra-Class: Average distance from each sample to all other
samples in the same class at Electrode j
 Inter-Class: Average distance from each sample to all other
samples in different class at Electrode j
Averaging: If for Sample i (on average of selected electrodes)

Average intra-class
Average inter-class
distance over all
< distance over all
electrodes
electrodes
We claim that Sample i is correctly classified.

Voting: If for Sample i at Electrode j (vote)
Intra-class distance < Inter-class distance (good vote)
Based on selected electrodes, if # of good votes > # of bad votes, then
Sample i is correctly classified.
Chaovalitwongse et al., KDD (2007) and Chaovalitwongse et al., Operations Research (forthcoming)
Distance Averaging: Training
di 2
d i1
d i1
di 2
Sample i at Feature 1
Sample i at Feature 2
d im
∙∙∙
d im
Sample i at Feature m
Select a subset of features ( s  1,2,..., m ) such that
as many samples as possible.
 dij   dij
j s
j s
Industrial & Systems Engineering Rutgers
University
16
Majority Voting: Training
Negative
dij
Positive
i
Negative
di j
dij
Feature j
Positive
i’
di j
Feature j
aij  1 (Correct) if dij  dij ;
aij  0 (Incorrect) otherwise.
Industrial & Systems Engineering Rutgers
University
17
SFM Optimization Model
n  total number of samples.
m  total number of electrodes.
Intra-Class
dij  average distance from sample i to all other samples
in the same class, for i  1...n and j  1...m.
Inter-Class
dij  average distance from sample i to all other samples
in different class, for i  1...n and j  1...m.
1 if sample i is correctly classified;
yi  
0 otherwise, for i  1,..., n.
1 if electrode j is selected;
xj  
0 otherwise, for j  1,..., m.
Chaovalitwongse et al., KDD (2007) and Chaovalitwongse et al., Operations Research (forthcoming)
Averaging SFM
n
max
 yi
Maximize the number of correctly
classified samples
i 1
s.t.
m
m
j 1
j 1
m
m
 dij x j   dij x j  M 1 yi
d
j 1
ij
x j   dij x j  M 2 1  yi  for i  1,..., n
j
1
m
x
j 1
for i  1,..., n
j 1
Logical constraints
on intra-class and
inter-class
distances if a
sample is correctly
classified
Must select at least one electrode
x j  0,1
for j  1,..., m
yi  0,1
for i  1,..., n
Chaovalitwongse et al., KDD (2007) and Chaovalitwongse et al., Operations Research (forthcoming)
Voting SFM
n
max
Maximize the number of correctly
classified samples
 yi
i 1
m
s.t.
xj
m
a x  2
ij
j 1
m
xj
j
j 1
m
 2 a x
j 1
j 1
m
x
j 1
j
1
ij
j
 M 1 yi
for i  1,..., n
   M 2 1  yi  for i  1,..., n
Must select at least one electrode
x j  0,1
for j  1,..., m
yi  0,1
for i  1,..., n
 0
Logical constraints: Must
win the voting if a sample
is correctly classified
Precision matrix, A contains elements of
1 if sample i is correctly classified at electrode j (good vote);
aij  
0 otherwise (bad vote), for i  1,..., n and j  1,..., m.
Chaovalitwongse et al., KDD (2007) and Chaovalitwongse et al., Operations Research (forthcoming)
Support Feature Machine
Training
Normal
Samples
Testing
Abnormal
Samples
Step 1: For individual feature
(electrode), apply the nearest
neighbor rule to every training
sample to construct the distance
and accuracy matrices
– voting matrix
– distance matrices
Step 2: Formulate and solve the
SFM models and obtain the
optimal feature (electrode)
selection
Unlabeled
Samples
Step 3: Employ the
nearest neighbor rule to
classify unlabeled data
to the closest baseline
(training) data based on
the selected features
(electrodes)
x – electrode selection
y – training accuracy
Support Feature Machine
21
Support Vector Machine
Feature 3
Pre-Seizure
A data vector of EEG sample
1
Ch 1
Ch 2
Ch 3
…………….
Feature 2
Normal
Feature 1
Ch n
2
3
4 ……
n
Application in Epilepsy
Diagnosis
23
Facts about Epilepsy

About 3 million Americans and other 60 million people worldwide (about 1%
of population) suffer from Epilepsy.

Epilepsy is the second most common brain disorder (after stroke), which
causes recurrent seizures (not vice versa).

Seizures usually occur spontaneously, in the absence of external triggers.

Epileptic seizures occur when a massive group of neurons in the cerebral
cortex suddenly begin to discharge in a highly organized rhythmic pattern.

Seizures cause temporary disturbances of brain functions such as motor
control, responsiveness and recall which typically last from seconds to a few
minutes.

Based on 1995 estimates, epilepsy imposes an annual economic burden of
$12.5 billion* in the U.S. in associated health care costs and losses in
employment, wages, and productivity.

Cost per patient ranged from $4,272 for persons** with remission after initial
diagnosis and treatment to $138,602 for persons** with intractable and
frequent seizures.
*Begley et al., Epilepsia (2000); **Begley et al., Epilepsia (1994).
24
Simplified EEG System and
Intracranial Electrode Montage
ROF
43 2 1
RST
1234
4321
1 2 34
1
2
1
2
3
3
4
5
RTD
LOF
LST
4
5
LTD
Electroencephalogram (EEG) is a traditional tool for evaluating
the physiological state of the brain by measuring voltage
potentials produced by brain cells while communicating
25
Scalp EEG Acquisition
F p1
F7
T3
F3
F p2
Fz
C3
F8
C4
Pz
P3
F4
T4
P4
T5
T6
O1
O2
Oz
18 Bipolar Channels
Goals: How can we help?

Seizure Prediction





Recognizing (data-mining) abnormality patterns in EEG signals
preceding seizures
Normal versus Pre-Seizure
Alert when pre-seizure samples are detected (online classification)
e.g., statistical process control in production system, attack alerts from
sensor data, stock market analysis
EEG Classification: Routine EEG Check




Quickly identify if the patients have epilepsy
Epilepsy versus Non-Epilepsy
Many causes of seizures: Convulsive or other seizure-like activity can be
non-epileptic in origin, and observed in many other medical conditions.
These non-epileptic seizures can be hard to differentiate and may lead
to misdiagnosis.
e.g., medical check-up, normal and abnormal samples
27
Normal versus Pre-Seizure
28
10-second EEGs: Seizure Evolution
Normal
Seizure Onset
Chaovalitwongse et al., Annals of Operations Research (2006)
Pre-Seizure
Post-Seizure
29
Normal versus Pre-Seizure
Data Set
EEG Dataset Characteristics
Patient ID
Seizure types
Duration of EEG(days)
# of seizures
1
CP, SC
3.55
7
2
CP, GTC, SC
10.93
7
3
CP
8.85
22
4
,SC
5.93
19
5
CP, SC
13.13
17
6
CP, SC
11.95
17
7
CP, SC
3.11
9
8
CP, SC
6.09
23
9
CP, SC
11.53
20
10
CP
9.65
12
84.71
153
Total
CP: Complex Partial; SC subclinical; GTC: Generalized Tonic/Clonic
Sampling Procedure

Randomly and uniformly sample 3 EEG epochs per
seizure from each of normal and pre-seizure states.

For example, Patient 1 has 7 seizures. There are 21
normal and 21 pre-seizure EEG epochs sampled.

Use leave-one(seizure)-out cross validation to perform
training and testing.
Normal
8 hours
8 hours
30 minutes
Seizure
8 hours
8 hours
Pre-seizure
30 minutes
Seizure
Duration of EEG



Measure the brain dynamics from
EEG signals
Apply dynamical measures (based on
chaos theory) to non-overlapping
EEG epochs of 10.24 seconds =
2048 points.
Maximum Short-Term Lyapunov
Exponent
 measure the stability/chaoticity of
EEG signals
 measure the average uncertainty
along the local eigenvectors and
phase differences of an attractor
in the phase space
Pardalos, Chaovalitwongse, et al., Math Programming (2004)
EEG Voltage
Information/Feature Extraction
from EEG Signals
Time
Evaluation

Sensitivity measures the fraction of positive cases that
are classified as positive.

Specificity measures the fraction of negative cases
classified as negative.
Sensitivity = TP/(TP+FN)
Specificity = TN/(TN+FP)


Type I error = 1-Specificity
Type II error = 1-Sensitivity
Chaovalitwongse et al., Epilepsy Research (2005)
Leave-One-Seizure-Out Cross
Validation
N1
P1
SFM
N2
N3
P3
N4
P4
N5
Selected
Electrodes
P2
P5
1
2
3
4
5
6
7
.
.
.
23
24
25
26
Training Set
Testing Set
N – EEGs from Normal State
P – EEGs from Pre-Seizure State
assume there are 5 seizures in the recordings
34
EEG Classification

Support Vector Machine [Chaovalitwongse et al., Annals of OR (2006)]



Ensemble K-Nearest Neighbor [Chaovalitwongse et al., IEEE SMC: Part A (2007)]




Project time series data in a high dimensional (feature) space
Generate a hyperplane that separates two groups of data – minimizing the
errors
Use each electrode as a base classifier
Apply the NN rule using statistical time series distances and optimize the
value of “k” in the training
Voting and Averaging
Support Feature Machine [Chaovalitwongse et al., SIGKDD (2007); Chaovalitwongse
et al., Operations Research (forthcoming)]



Use each electrode as a base classifier
Apply the NN rule to the entire baseline data
Optimize by selecting the best group of classifiers (electrodes/features)
 Voting: Optimizes the ensemble classification
 Averaging: Uses the concept of inter-class and intra-class distances (or
prediction scores)
35
Performance Characteristics:
Upper Bound
NN -> Chaovalitwongse et al., Annals of Operations Research (2006)
SFM -> Chaovalitwongse et al., SIGKDD (2007); Chaovalitwongse et al., Operations Research (forthcoming)
KNN -> Chaovalitwongse et al., IEEE Trans Systems, Man, and Cybernetics: Part A (2007)
37
Separation of Normal and PreSeizure EEGs
From 3 electrodes selected by SFM
From 3 electrodes not selected by SFM
Performance Characteristics:
Validation
SVM-> Chaovalitwongse et al., Annals of Operations Research (2006)
SFM -> Chaovalitwongse et al., SIGKDD (2007); Chaovalitwongse et al., Operations Research (forthcoming)
KNN -> Chaovalitwongse et al., IEEE Trans Systems, Man, and Cybernetics: Part A (2007)
39
Epilepsy versus Non-Epilepsy
40
Epilepsy versus Non-Epilepsy
Data Set
Non-Epilepsy patients
Epilepsy patients
Elec 1
Elec 2
…..
…..
Elec 17
Elec 18
150 points
(25 minutes)
30 points
30 points
5 sampled epochs



Routine EEG check: 25-30 minutes of recordings ~ with scalp
electrodes
Each sample is 5-minute EEG epoch (30 points of STLmax values).
Each sample is in the form of 18 electrodes X 30 points
Leave-One-Patient-Out Cross
Validation
N1
E1
SFM
N2
N3
E3
N4
E4
N5
Selected
Electrodes
E2
E5
1
2
3
4
5
6
7
.
.
.
23
24
25
26
Training Set
Testing Set
N – Non-Epilepsy
P – Epilepsy
42
Voting SFM: Validation
Voting SFM Performance – Average of 10 Patients
100%
90%
Overall Accuracy
80%
70%
60%
DTW
50%
EU
TS
40%
30%
20%
10%
0%
KNN
k=5
SFM
KNN
k=7
SFM
KNN
k=9
SFM
KNN
k=11
SFM
KNN
k=All
SFM
43
Averaging SFM: Validation
Averaging SFM Performance – Average of 10 Patients
100%
90%
Overall Accuracy
80%
70%
60%
DTW
50%
EU
40%
TS
30%
20%
10%
0%
KNN
k=5
SFM
KNN
k=7
SFM
KNN
k=9
SFM
KNN
k=11
SFM
KNN
k=All
SFM
44
Selected Electrodes From Averaging SFM
100%
Averaging SFM - DTW
Averaging SFM - EU
90%
Averaging SFM - TS
80%
Selection Percentage
Fp1 – C3
T6 – Oz
Fz – Oz
1
16
17
70%
60%
50%
Fp1
Fp2
40%
F7
F3
Fz
F4
F8
30%
T3
C3
C4
T4
20%
Pz
P3
P4
T5
T6
10%
O1
O2
Oz
0%
1
2
3
4
5
6
7
8
9
10
11
12
Electrode
13
14
15
16
17
18
Other Medical Diagnosis
46
Other Medical Datasets

Breast Cancer



Diabetes



Patient Records (Age, body mass index, blood pressure, etc.)
Diabetic or Not
Heart Disease



Features of Cell Nuclei (Radius, perimeter, smoothness, etc.)
Malignant or Benign Tumors
General Patient Info, Symptoms (e.g., chest pain), Blood Tests
Identify Presence of Heart Disease
Liver Disorders


Features of Blood Tests
Detect the Presence of Liver Disorders from Excessive Alcohol Consumption
47
Performance
Training
LP SVM
NLP SVM
Testing
V-SFM
A-SFM
LP SVM
NLP SVM
V-NN
A-NN
V-SFM
A-SFM
WDBC
98.08
96.17
97.28
97.42
97.00
95.38
91.60
93.18
94.99
96.01
HD
85.06
84.66
86.48
86.92
82.96
83.94
80.87
82.77
82.49
84.92
PID
77.66
77.51
75.01
77.96
76.93
76.09
63.14
74.94
72.75
75.83
BLD
65.71
57.97
63.46
66.43
65.71
57.97
38.38
54.09
58.20
59.57
48
Average Number of Selected
Features
WDBC
HD
PID
BLD
LP SVM
NLP SVM
30
30
V-SFM
A-SFM
11.6
8.5
13
13
8
8
7.4
8.7
4.3
4.5
6
6
3.3
3.7
49
Medical Data Signal Processing
Apparatus (MeDSPA)

Quantitative analyses of medical data

Neurophysiological data (e.g., EEG, fMRI) acquired during brain
diagnosis

Envisioned to be an automated decision-support system configured
to accept input medical signal data (associated with a spatial
position or feature) and provide measurement data to help
physicians obtain a more confident diagnosis outcome.

To improve the current medical diagnosis and prognosis by
assisting the physicians



recognizing (data-mining) abnormality patterns in medical data
recommending the diagnosis outcome (e.g., normal or abnormal)
identifying a graphical indication (or feature) of abnormality (localization)
50
Automated Abnormality Detection Paradigm
Data Acquisition
Multichannel
Brain Activity
Optimization:
Feature Extraction/
Clustering
Interface
Technology
Feature 3
Statistical Analysis:
Pattern Recognition
Feature 2
Feature 1
Nurse
Stimulator
User/Patient
Drug
Initiate a warning or a variety
of therapies (e.g., electrical
stimulation, drug injection)
Acknowledgement:
Collaborators





E. Micheli-Tzanakou, PhD
L.D. Iasemidis, PhD
R.C. Sachdeo, MD
R.M. Lehman, MD
B.Y. Wu, MD, PhD
Students
 Y.J. Fan, MS
 Other undergrad students
52
Thank you for your attention!
Questions?
53
Download