Data mining and its applications in medicine

advertisement
CSE
300
Data mining and its application and
usage in medicine
By Radhika
1
Data Mining and Medicine

CSE
300
History
 Past 20 years with relational databases
 More dimensions to database queries


earliest and most successful area of data mining
Mid 1800s in London hit by infectious disease
 Two theories
– Miasma theory  Bad air propagated disease
– Germ theory  Water-borne
 Advantages
– Discover trends even when we don’t understand reasons
– Discover irrelevant patterns that confuse than enlighten
– Protection against unaided human inference of patterns provide
quantifiable measures and aid human judgment

Data Mining
 Patterns persistent and meaningful
 Knowledge Discovery of Data
2
The future of data mining

10 biggest killers in the US

Data mining = Process of discovery of interesting,
meaningful and actionable patterns hidden in large
amounts of data
CSE
300
3
Major Issues in Medical Data Mining

CSE
300

Heterogeneity of medical data
 Volume and complexity
 Physician’s interpretation
 Poor mathematical categorization
 Canonical Form
 Solution: Standard vocabularies, interfaces
between different sources of data integrations,
design of electronic patient records
Ethical, Legal and Social Issues
 Data Ownership
 Lawsuits
 Privacy and Security of Human Data
 Expected benefits
 Administrative Issues
4
Why Data Preprocessing?

CSE
300

Patient records consist of clinical, lab parameters,
results of particular investigations, specific to tasks
 Incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
 Noisy: containing errors or outliers
 Inconsistent: containing discrepancies in codes or
names
 Temporal chronic diseases parameters
No quality data, no quality mining results!
 Data warehouse needs consistent integration of
quality data
 Medical Domain, to handle incomplete,
inconsistent or noisy data, need people with
domain knowledge
5
What is Data Mining? The KDD Process
CSE
300
Pattern Evaluation
Data Mining
Task-relevant
Data
Data
Warehouse
Selection
Data Cleaning
Data Integration
Databases
6
From Tables and Spreadsheets to Data Cubes

CSE
300


A data warehouse is based on a multidimensional data
model that views data in the form of a data cube
A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions
 Dimension tables, such as item (item_name, brand,
type), or time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold)
and keys to each of related dimension tables
W. H. Inmon:“A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making
process.”
7
Data Warehouse vs. Heterogeneous DBMS

CSE
300
Data warehouse: update-driven, high performance
 Information from heterogeneous sources is
integrated in advance and stored in warehouses for
direct query and analysis
 Do not contain most current information
 Query processing does not interfere with
processing at local sources
 Store and integrate historical information
 Support complex multidimensional queries
8
Data Warehouse vs. Operational DBMS
OLTP (on-line transaction processing)
 Major task of traditional relational DBMS
 Day-to-day operations: purchasing, inventory,
banking, manufacturing, payroll, registration,
accounting, etc.
 OLAP (on-line analytical processing)
 Major task of data warehouse system
 Data analysis and decision making
 Distinct features (OLTP vs. OLAP):
 User and system orientation: customer vs. market
 Data contents: current, detailed vs. historical,
consolidated
 Database design: ER + application vs. star + subject
 View: current, local vs. evolutionary, integrated
 Access patterns: update vs. read-only but complex
queries

CSE
300
9
CSE
300
10
Why Separate Data Warehouse?
High performance for both systems
 DBMS tuned for OLTP: access methods, indexing,
concurrency control, recovery
 Warehouse tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
 Different functions and different data:
 Missing data: Decision support requires historical
data which operational DBs do not typically maintain
 Data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
 Data quality: different sources typically use
inconsistent data representations, codes and formats
which have to be reconciled

CSE
300
11
CSE
300
12
CSE
300
13
Typical OLAP Operations

CSE
300




Roll up (drill-up): summarize data
 by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice:
 project and select
Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes.
Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
14
CSE
300
15
CSE
300
16
Multi-Tiered Architecture
CSE
300
other
Monitor
&
Integrator
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine
Front-End Tools
17
Steps of a KDD Process

CSE
300








Learning the application domain:
 relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
 Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining
 summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns,
etc.
Use of discovered knowledge
18
Common Techniques in Data Mining

CSE
300
Predictive Data Mining
Most important
 Classification: Relate one set of variables in data to
response variables
 Regression: estimate some continuous value
Descriptive Data Mining
 Clustering: Discovering groups of similar instances
 Association rule extraction


 Variables/Observations

Summarization of group descriptions
19
Leukemia

CSE

300



Different types of cells look very similar
Given a number of samples (patients)
 can we diagnose the disease accurately?
 Predict the outcome of treatment?
 Recommend best treatment based of previous
treatments?
Solution: Data mining on micro-array data
38 training patients, 34 testing patients ~ 7000 patient
attributes
2 classes: Acute Lymphoblastic Leukemia(ALL) vs
Acute Myeloid Leukemia (AML)
20
Clustering/Instance Based Learning

CSE
300




Uses specific instances to perform classification than general
IF THEN rules
Nearest Neighbor classifier
Most studied algorithms for medical purposes
Clustering– Partitioning a data set into several groups
(clusters) such that
 Homogeneity: Objects belonging to the same cluster are
similar to each other
 Separation: Objects belonging to different clusters are
dissimilar to each other.
Three elements
 The set of objects
 The set of attributes
 Distance measure
21
Measure the Dissimilarity of Objects
CSE
300




Find best matching instance
Distance function
 Measure the dissimilarity between a pair of
data objects
Things to consider
 Usually very different for interval-scaled,
boolean, nominal, ordinal and ratio-scaled
variables
 Weights should be associated with different
variables based on applications and data
semantic
Quality of a clustering result depends on both the
distance measure adopted and its implementation
22
Minkowski Distance

CSE
300
Minkowski distance: a generalization
d (i, j)  q | x  x |q  | x  x |q ... | x  x |q (q  0)
i1
j1
i2
j2
ip
jp


If q = 2, d is Euclidean distance
If q = 1, d is Manhattan distance
xi
Xi (1,7)
12
8.48
q=2
q=1
6
6
Xj(7,1)
xj
23
Binary Variables

A contingency table for binary data
CSE
300
Object j
Object i

1
0
1
a
b
0
c
d
sum a  c b  d
sum
ab
cd
p
Simple matching coefficient
d (i, j ) 
bc
ab cd
24
Dissimilarity between Binary Variables

Example
CSE
300
Object 1
Object 2
A1
1
1
A2
0
1
A3
1
1
A4
1
0
A5
1
0
A6
0
0
A7
0
1
Object 2
1
1
2
Object 1 0
2
sum 4
0
2
1
3
sum
4
3
7
4
2

2
d (O ,O ) 

1 2
2  2  2 1 7
25
K-nearest neighbors algorithm

CSE
300

Initialization
 Arbitrarily choose k objects as the initial cluster
centers (centroids)
Iteration until no change
 For each object Oi
 Calculate the distances between Oi and the k centroids
 (Re)assign Oi to the cluster whose centroid is the
closest to Oi

Update the cluster centroids based on current
assignment
26
k-Means Clustering Method
current
clusters
10
9
CSE
300
10
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
cluster
mean
0
0
1
2
3
4
5
6
7
8
9
10
new
clusters
10
0
1
2
3
0
1
2
3
4
5
6
7
8
9
10
objects
relocated
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
1
2
3
4
5
6
7
8
9
10
0
4
5
6
7
8
9
10
27
Dataset

CSE

300


Data set from UCI repository
http://kdd.ics.uci.edu/
768 female Pima Indians evaluated for diabetes
After data cleaning 392 data entries
28
Hierarchical Clustering

CSE

300


Groups observations based on dissimilarity
Compacts database into “labels” that represent the
observations
Measure of similarity/Dissimilarity
 Euclidean Distance
 Manhattan Distance
Types of Clustering
 Single Link
 Average Link
 Complete Link
29
Hierarchical Clustering: Comparison
Single-link
CSE
300
Complete-link
1
5
4
3
5
5
2
2
5
1
2
1
3
2
6
3
3
4
1
4
4
Average-link
Centroid distance
1
5
5
2
5
6
4
1
2
2
5
3
3
2
6
3
1
4
4
6
1
4
3
30
Compare Dendrograms
Single-link
Complete-link
CSE
300
1 2
5
3 6
4
5
3 6
5
3 6
4
Centroid distance
Average-link
1 2
1 2
4
2
5
3 6 4 1
31
Which Distance Measure is Better?

CSE
300


Each method has both advantages and disadvantages;
application-dependent
Single-link
 Can find irregular-shaped clusters
 Sensitive to outliers
Complete-link, Average-link, and Centroid distance
 Robust to outliers
 Tend to break large clusters
 Prefer spherical clusters
32
Dendrogram from dataset
CSE
300



Minimum spanning tree through the observations
Single observation that is last to join the cluster is patient whose
blood pressure is at bottom quartile, skin thickness is at bottom
quartile and BMI is in bottom half
Insulin was however largest and she is 59-year old diabetic
33
Dendrogram from dataset
CSE
300

Maximum dissimilarity between observations in one
cluster when compared to another
34
Dendrogram from dataset
CSE
300

Average dissimilarity between observations in one
cluster when compared to another
35
Supervised versus Unsupervised Learning

CSE
300

Supervised learning (classification)
 Supervision: Training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
 New data is classified based on training set
Unsupervised learning (clustering)
 Class labels of training data are unknown
 Given a set of measurements, observations, etc.,
need to establish existence of classes or clusters in
data
36
Classification and Prediction

CSE
300





Derive models that can use patient specific
information, aid clinical decision making
Apriori decision on predictors and variables to predict
No method to find predictors that are not present in the
data
Numeric Response
 Least Squares Regression
Categorical Response
 Classification trees
 Neural Networks
 Support Vector Machine
Decision models
 Prognosis, Diagnosis and treatment planning
 Embed in clinical information systems
37
Least Squares Regression

Find a linear function of predictor variables that
minimize the sum of square difference with response
Supervised learning technique

Predict insulin in our dataset :glucose and BMI

CSE
300
38
Decision Trees

CSE
300

Decision tree
 Each internal node tests an attribute
 Each branch corresponds to attribute value
 Each leaf node assigns a classification
ID3 algorithm
 Based on training objects with known class labels to
classify testing objects
 Rank attributes with information gain measure
 Minimal height
 least number of tests to classify an object


Used in commercial tools eg: Clementine
ASSISTANT





Deal with medical datasets
Incomplete data
Discretize continuous variables
Prune unreliable parts of tree
Classify data
39
Decision Trees
CSE
300
40
Algorithm for Decision Tree Induction
CSE
300
Basic algorithm (a greedy algorithm)
 Attributes are categorical (if continuous-valued,
they are discretized in advance)
 Tree is constructed in a top-down recursive
divide-and-conquer manner
 At start, all training examples are at the root
 Test attributes are selected on basis of a heuristic
or statistical measure (e.g., information gain)
 Examples are partitioned recursively based on
selected attributes
41
Training Dataset
Age
BMI
Hereditary
Vision
Risk of
Condition X
P1
<=30
high
no
fair
no
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
P12
P13
P14
<=30
>40
31…40
31…40
31…40
>40
<=30
<=30
31…40
<=30
>40
>40
31…40
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
CSE
300
42
Construction of A Decision Tree for “Condition X”
[P1,…P14]
Yes: 9, No:5
CSE
300
<=30
[P1,P2,P8,P9,P11]
Yes: 2, No:3
History
no
[P1,P2,P8]
Yes: 0,
No:3
NO
Age?
30…40
>40
[P3,P7,P12,P13]
Yes: 4, No:0
[P4,P5,P6,P10,P14]
Yes: 3, No:2
Vision
YES
yes
[P9,P11]
Yes: 2,
No:0
YES
excellent
[P6,P14]
Yes: 0,
No:2
NO
fair
[P4,P5,P10]
Yes: 3,
No:0
YES
43
Entropy and Information Gain

CSE

300
S contains si tuples of class Ci for i = {1, ..., m}
Information measures info required to classify any
arbitrary tuple
m
si
si
I( s1,s2,...,sm )   log 2
s
i 1 s

Entropy of attribute A with values {a1,a2,…,av}
s1 j  ...  smj
E(A)  
I (s1 j ,..., smj )
s
j 1
v

Information gained by branching on attribute A
Gain( A )  I(s1, s2,..., sm)  E( A )
44
Entropy and Information Gain

CSE
300
Select attribute with the highest information gain (or
greatest entropy reduction)
 Such attribute minimizes information needed to
classify samples
45
Rule Induction

CSE 
300
IF conditions THEN Conclusion
Eg: CN2
 Concept description:
 Characterization: provides a concise and succinct summarization of
given collection of data
 Comparison: provides descriptions comparing two or more
collections of data



Training set, testing set
Imprecise
Predictive Accuracy
 P/P+N
46
Example used in a Clinic

CSE
300



Hip arthoplasty trauma surgeon predict patient’s longterm clinical status after surgery
Outcome evaluated during follow-ups for 2 years
2 modeling techniques
 Naïve Bayesian classifier
 Decision trees
Bayesian classifier
 P(outcome=good) = 0.55 (11/20 good)
 Probability gets updated as more attributes are
considered
 P(timing=good|outcome=good) = 9/11 (0.846)
 P(outcome = bad) = 9/20
P(timing=good|outcome=bad) = 5/9
47
CSE
300
Nomogram
48
Bayesian Classification

CSE
300


Bayesian classifier vs. decision tree
 Decision tree: predict the class label
 Bayesian classifier: statistical classifier; predict
class membership probabilities
Based on Bayes theorem; estimate posterior
probability
Naïve Bayesian classifier:
 Simple classifier that assumes attribute
independence
 High speed when applied to large databases
 Comparable in performance to decision trees
49
Bayes Theorem

CSE

300

Let X be a data sample whose class label is unknown
Let Hi be the hypothesis that X belongs to a particular
class Ci
P(Hi) is class prior probability that X belongs to a
particular class Ci
 Can be estimated by ni/n from training data
samples
 n is the total number of training data samples
 ni is the number of training data samples of class Ci
P( X | H )P(H )
i
i
P(H | X ) 
i
P( X )
Formula of Bayes Theorem
50
More classification Techniques

CSE
300
Neural Networks
 Similar to pattern recognition properties of biological
systems
 Most frequently used
 Multi-layer perceptrons
– Input with bias, connected by weights to hidden, output
 Backpropagation neural networks

Support Vector Machines
 Separate database to mutually exclusive regions
 Transform to another problem space
 Kernel functions (dot product)
 Output of new points predicted by position

Comparison with classification trees
 Not possible to know which features or combination of
features most influence a prediction
51
Multilayer Perceptrons

CSE
300

Non-linear transfer functions to weighted sums of
inputs
Werbos algorithm
 Random weights
 Training set, Testing set
52
Support Vector Machines

CSE
300


3 steps
 Support Vector creation
 Maximal distance between points found
 Perpendicular decision boundary
Allows some points to be misclassified
Pima Indian data with X1(glucose) X2(BMI)
53
What is Association Rule Mining?
Finding frequent patterns, associations, correlations, or causal
CSE
structures among sets of items or objects in transaction
300
databases, relational databases, and other information
repositories
PatientID Conditions

1
2
3
4
5
High LDL Low HDL,
High BMI, Heart Failure
High LDL Low HDL,
Heart Failure, Diabetes
Diabetes
High LDL Low HDL,
Heart Failure
High BMI , High LDL
Low HDL, Heart Failure
Example of Association Rules
{High LDL, Low HDL} 
{Heart Failure}

People who have high LDL
(“bad” cholesterol), low HDL
(“good cholesterol”) are at
higher risk of heart failure.
54
Association Rule Mining

CSE
300
Market Basket Analysis
 Same groups of items bought placed together
 Healthcare
 Understanding among association among patients with
demands for similar treatments and services



Goal : find items for which joint probability of
occurrence is high
Basket of binary valued variables
Results form association rules, augmented with
support and confidence
55
Association Rule Mining

CSE
300

Association Rule

An implication
expression of the form
X  Y, where X and Y
are itemsets and
XY=
D
Trans containing
both X and Y
Rule Evaluation
Metrics


Trans
Trans
Support (s): Fraction of
containing X
containing Y
transactions that
contain both X and Y
# trans containing ( X  Y )
Confidence (c):
P( X  Y ) 
Measures how often
# trans in D
items in Y appear in
transactions that
# trans containing ( X  Y )
contain X
P( X | Y ) 
# trans containing X
56
The Apriori Algorithm

CSE
300 



Starts with most frequent 1-itemset
Include only those “items” that pass threshold
Use 1-itemset to generate 2-itemsets
Stop when threshold not satisfied by any itemset
L1 = {frequent items};
for (k = 1; Lk !=; k++) do
 Candidate Generation: Ck+1 = candidates
generated from Lk;
 Candidate Counting: for each transaction t in
database do increment the count of all candidates
in Ck+1 that are contained in t
 Lk+1 = candidates in Ck+1 with min_sup
return k Lk;
57
Apriori-based Mining
CSE
300
Data base D
TID
10
20
30
40
Items
a, c, d
b, c, e
a, b, c, e
b, e
1-candidates
Scan D
Min_sup=0.5
3-candidates
Scan D
Itemset
bce
Freq 3-itemsets
Itemset
bce
Sup
2
Itemset
a
b
c
d
e
Freq 1-itemsets
Sup
2
3
3
1
3
Itemset
a
b
c
Sup
2
3
3
e
3
Freq 2-itemsets
Itemset
ac
bc
be
ce
Sup
2
2
3
2
2-candidates
Counting
Itemset
ab
ac
ae
bc
be
ce
Sup
1
2
1
2
3
2
Itemset
ab
ac
ae
bc
be
ce
Scan D
58
Principle Component Analysis

CSE
300
Principle Components
 In cases of large number of variables, highly possible that
some subsets of the variables are very correlated with each
other. Reduce variables but retain variability in dataset
 Linear combinations of variables in the database
 Variance of each PC maximized
– Display as much spread of the original data
 PC orthogonal with each other
– Minimize the overlap in the variables
 Each component normalized sum of square is unity
– Easier for mathematical analysis

Number of PC < Number of variables
 Associations found
 Small number of PC explain large amount of variance

Example 768 female Pima Indians evaluated for diabetes
 Number of times pregnant, two-hour oral glucose tolerance test
(OGTT) plasma glucose, Diastolic blood pressure, Triceps skin
fold thickness, Two-hour serum insulin, BMI, Diabetes pedigree
function, Age, Diabetes onset within last 5 years
59
PCA Example
CSE
300
60
National Cancer Institute
CSE
300





CancerNet http://www.nci.nih.gov
CancerNet for Patients and the Public
CancerNet for Health Professionals
CancerNet for Basic Reasearchers
CancerLit
61
Conclusion

CSE
300




About ¾ billion of people’s medical records are
electronically available
Data mining in medicine distinct from other fields due
to nature of data: heterogeneous, with ethical, legal
and social constraints
Most commonly used technique is classification and
prediction with different techniques applied for
different cases
Associative rules describe the data in the database
Medical data mining can be the most rewarding
despite the difficulty
62
CSE
300
Thank you !!!
63
Download