IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

advertisement
IE 483/583
Knowledge Discovery and Data Mining
Dr. Siggi Olafsson
Fall 2003
Fall 2004
Data Mining
1
What is Data
Mining?
(… and should I be here?)
Fall 2004
Data Mining
2
Dilbert Replies ...
Fall 2004
Data Mining
3
Some Definitions
“Data mining is the extraction of implicit,
previously unknown, and potentially
useful information from data.”
“Data mining is the process of exploration
and analysis, by automatic or
semiautomatic means, of large quantities
of data in order to discover meaningful
patterns and rules.”
Fall 2004
Data Mining
4
What can Data Mining Do?
• Classification
• Prediction
Supervised
• Association discovery
• Clustering
Fall 2004
Unsupervised
Data Mining
5
Applications of Data Mining
•
•
•
•
•
•
Manufacturing Process Improvement
Sales and Marketing
Mapping the Human Genome
Diagnosing Breast Cancer
Financial Crime Identification
Portfolio Management
Fall 2004
Data Mining
6
Technical Background
• Machine Learning
– Data mining: business-oriented use of AI
• Statistics
– Regression, sampling, DOE, etc
• Decision Support
– Data warehousing, data marts, OLAP, etc
• Interdisciplinary tools put together to form
the process of knowledge discovery in
databases …
Fall 2004
Data Mining
7
Historical Perspective
< 40
40s
50s
60s
70s
80s
90s
Fall 2004
Stat
AI
AI
Stat
Stat
IR
DB
IR
AI
Stat
AI
DB
Bayes theorem, regression, etc.
Neural networks
Nearest neighbor, single link, perceptron
Resampling, bias reduction, jackknife
Linear models for classification,
exploratory data analysis (EDA)
Similarity measures, clustering
Relational data model
Smart IR systems
Genetic algorithms
EM algorithm, k-means clustering
Kohonen maps, decision trees
Association rule algorithms, web & search
engines,
data warehousing, OLAP
Data Mining
8
What Changed?
• Very large databases
• Increased computational power as enabler
• Business perspective
Fall 2004
Data Mining
9
Knowledge Discovery in Databases
Data Warehouse Systems Engineering
Databases
Data warehouse
Prepared Data
Knowledge
Model/Structures
Knowledge Discovery and Data Mining
Fall 2004
Data Mining
10
Course Information
• We assume data is ready for mining
• Thus, we focus on:
– models and structures, and
– algorithms
• More information on course homepage
http://www.public.iastate.edu/~olafsson/mining.html
Fall 2004
Data Mining
11
Fall 2004
Data Mining
12
Course Outline
•
•
•
•
•
•
Introduction
Exploratory Data Mining
Supervised Learning
Unsupervised Learning
Optimization Methods in Learning
Selected Advanced Topics
– Mining the Web
– Customer Relationship Management (CRM)
• Course Review
Fall 2004
Data Mining
13
Questions?
Fall 2004
Data Mining
14
Data Mining
• Discover patterns in data
– automatic or semi-automatic process
– meaningful or useful pattern
– large amounts of data
• What does such a pattern look like?
Black box
Fall 2004
Transparent box
Data Mining
15
Describing Structural Patterns
• Some ways of representing knowledge:
–
–
–
–
–
–
Fall 2004
Decision tables
Decision trees
Classification rules
Association rules
Regression trees
Clusters
Data Mining
16
The Weather Problem
Fall 2004
Outlook
Sunny
Sunny
Overcast
Rainy
Rainy
Rainy
Overcast
Sunny
Sunny
Rainy
Sunny
Overcast
Overcast
Rainy
Temp. Humidity
Hot
High
Hot
High
Hot
High
Mild
High
Cool Normal
Cool Normal
Cool Normal
Mild
High
Cool Normal
Mild
Normal
Mild
Normal
Mild
High
Hot
Normal
MildData Mining
High
Windy
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
17
A Decision List
If outlook = sunny and humidity = high
If outlook = rainy and windy = true
If outlook = overcast
If humidity = normal
If none of the above
then play = no
then play = no
then play = yes
then play = yes
then play = yes
• These are classification rules
Fall 2004
Data Mining
18
Association Rules
• Many association rules can be inferred:
if temperature = cool then humidity = normal
if humidity = normal and windy = false then play = yes
if outlook = sunny and play = no then humidity = high
Fall 2004
Data Mining
19
Three Layers of the Process
Inputs
Algorithms
Outputs
Fall 2004
Data Mining
20
Inputs
• Three forms
– Concepts
• concept description - what you want to learn
– Instances
• examples - what you learn from
– Attributes
• features of instances - variables you have values for
Fall 2004
Data Mining
21
Concepts: Styles of Learning
• Classification (supervised) learning
• Association learning
• Clustering
• Numeric prediction
Fall 2004
Data Mining
22
Instances: Learn from Examples
• Set of instances to be classified, or
associated, or clustered
• Example of concept to be learned
• Data set: flat file (single relation)
– denormalization
• Family tree example
– concept: sister
– example: family tree
Fall 2004
Data Mining
23
Family Tree
Peter (M) = Peggy (F)
Steven
M
Graham
M
Grace (F) = Ray (M)
Pam
F
=Ian
M
Anna
F
Fall 2004
Data Mining
Pippa
F
Brian
M
Nikki
F
24
Denormalizing Relational Data
Name
Gender
Parent1 Parent2 Name
Gender
Parent1 Parent2 Sister
of?
Steven
Male
Peter
Peggy
Pam
Female
Peter
Peggy
Yes
Ian
Male
Grace
Ray
Pippa
Female
Grace
Ray
Yes
Brian
Male
Grace
Ray
Pippa
Female
Grace
Ray
Yes
Anna
Female
Pam
Ian
Nikki
Female
Pam
Ian
Yes
Nikki
Female
Pam
Ian
Anna
Female
Pam
Ian
Yes
All
others
Fall 2004
No
Data Mining
25
Denormalization Problems
• Computational and storage costs
• Trivial regularities
customers
product
supplier
products
supplier
supplier address
• Infinite relations
Fall 2004
Data Mining
26
Content of Instances: Attributes
• Instance characterized by values of its
(predefined) set of attributes
–
–
–
–
–
Fall 2004
Numeric (“continuous”)
Nominal (categorical)
Ordinal (rank)
Interval
Ratio
Data Mining
Focus in this class
27
Data Preparation
• Data …
– assembly
• set of instances/denormalizing relational data
– integration
• enterprise-wide database/data warehouse
– cleaning
• missing data
– aggregation
• good information
Fall 2004
Data Mining
28
ARFF Format
• Used by JAVA package (Weka)
• Independent, unordered instances
• No relationship between instances
Fall 2004
Data Mining
29
Weather Data
% ARFF file for the weather data with some numeric features
%
@relation weather
@attribute
@attribute
@attribute
@attribute
@attribute
outlook { sunny, overcast, rainy }
temperature numeric
humidity numeric
windy { true, false }
play? { yes, no }
@data
%
% 14 instances
%
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
rainy, 70, 96, false, yes
rainy, 68, 80, false, yes
rainy, 65, 70, true, no
overcast, 64, 65, true, yes
sunny, 72, 95, false, no
sunny, 69, 70, false, yes
rainy, 75, 80, false, yes
sunny, 75, 70, true, yes
overcast, 72, 90, true, yes
overcast, 81, 75, false, yes
rainy, 71, 91, true, no
Fall 2004
Data Mining
30
Features
• % = comments
• @relation <name>
• @attribute <name> <type>
– Attribute types: Nominal and numeric
• @data
– List of instances
– Missing values represented by ?
Fall 2004
Data Mining
31
Other Issues
• Missing data
• Inaccurate values
• Look at the data!!!
Fall 2004
Data Mining
32
Recall the Three Layers of the
Data Mining Process
Done
Inputs
Algorithms
Next
Outputs
(structural patterns)
Fall 2004
Data Mining
33
Describing Structural Patterns
• Ways of representing knowledge:
–
–
–
–
–
–
Fall 2004
Decision tables
Decision trees
Classification rules
Association rules
Regression trees
Clusters
Data Mining
34
The Weather Problem
Fall 2004
Outlook
Sunny
Sunny
Overcast
Rainy
Rainy
Rainy
Overcast
Sunny
Sunny
Rainy
Sunny
Overcast
Overcast
Rainy
Temp. Humidity
Hot
High
Hot
High
Hot
High
Mild
High
Cool Normal
Cool Normal
Cool Normal
Mild
High
Cool Normal
Mild
Normal
Mild
Normal
Mild
High
Hot
Normal
MildData Mining
High
Windy
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
35
A Decision List
If outlook = sunny and humidity = high
If outlook = rainy and windy = true
If outlook = overcast
If humidity = normal
If none of the above
Fall 2004
Data Mining
then play = no
then play = no
then play = yes
then play = yes
then play = yes
36
A Decision Tree
Outlook
Sunny
Humidity
High
Play=No
Fall 2004
Rainy
Windy
Overcast
Play=Yes
TRUE
Play=No
Data Mining
37
Concepts: Styles of Learning
• Classification (supervised) learning
• Association learning
• Clustering
• Numeric prediction
Fall 2004
Data Mining
38
Classification Rules
• Classification easily read off decision trees
• How?
• Other direction possible, but not as
straightforward
If a and b then x
If c and d then x
Fall 2004
Data Mining
39
Corresponding Decision Tree
a
y
b
y
x
n
c
n
n
y
c
d
n
y
n
y
d
y
x
n
x
Fall 2004
Data Mining
40
Replicated Subtree Problem
X=1
n
y
Y=1
Y=1
n
b
a
If
If
If
If
Fall 2004
y
x=1
x=0
x=0
x=1
n
a
and
and
and
and
y=0
y=1
y=0
y=1
Data Mining
then
then
then
then
b
a
a
b
b
41
Replicated Subtree Problem
If x=1 and y=1 then a
If z=1 and w=1 then a
Otherwise b
x,y,z,w take values 1,2,3
Fall 2004
Data Mining
42
Rules with exceptions
If x and y then a
EXCEPT if z then b
• Account for new instances
• Exceptions from exceptions, etc
Fall 2004
Data Mining
43
Association Rules
• Coverage (support): number of instances it
predicts correctly
• Accuracy (confidence): coverage divided by
number of instances it applies to
If temperature = cool
then humidity = normal
• Coverage = 4
• Accuracy = 100%
Fall 2004
Data Mining
44
Interpretation
If windy = false and play = no then
outlook = sunny and humidity = high
If windy = false and play = no
then outlook = sunny
If windy = false and play = no
then humidity = high
If humidity = high and windy = false
and play = no
then outlook = sunny
Fall 2004
Data Mining
45
The Shapes Problem
Fall 2004
Shaded=standing
Unshaded=lying
Data Mining
46
Instances
Width
2
3
4
7
7
2
9
10
Fall 2004
Height
4
6
3
8
6
9
1
2
Sides
4
4
4
3
3
4
4
3
Data Mining
Class
standing
standing
lying
standing
lying
standing
lying
lying
47
Classification Rules
If width  3.5 and height < 7.0
If height  3.5 then standing
then lying
• Work well to classify these instances
• Problems?
Fall 2004
Data Mining
48
Relational Rules
If width > height then lying
If height > width then standing
• Rules comparing attributes to constants are called
propositional rules
• Structural patterns?
Fall 2004
Data Mining
49
CPU Performance Example
Cycle
time
1
2
3
4
5
…
207
208
209
Fall 2004
Main memory
(KB)
(ns)
min
max
MYCT
MMIN
MMAX
Cache
CACH
Channels
CHMIN
Performance
CHMAX
PRP
125
29
29
29
29
256
8000
8000
8000
8000
6000
32000
32000
32000
16000
256
32
32
32
32
16
8
8
8
8
128
32
32
32
16
198
269
220
172
132
125
480
480
2000
512
1000
8000
8000
4000
0
32
0
2
0
0
14
0
0
52
67
45
Data Mining
50
Numerical Prediction:
regression equation
PRP  56.1
 0.049 MYCT
 0.015MMIN
 0.006 MMAX
 0.630CACH
 0.270CHMIN
 1.46CHMAX
Fall 2004
Data Mining
51
Regression Tree
CHMIN
> 7.5
 7.5
CACH
 8.5
MMAX
MMAX
>28
(8.5,28]
MMAX
64.6
- Accuracy?
- Large and possibly awkward
Fall 2004
Data Mining
52
Model Trees
CHMIN
 7.5
> 7.5
CACH
 8.5
MMAX
>8.5
MMAX
LM4
 28000
LM5
> 28000
LM6
LM 1 PRP  8.29  0.004MMAX  2.77CHMIN
LM 2 PRP  

Fall 2004
Data Mining
53
Instance-Base Representation
• Store actual instances
• New instance: algorithm finds “most
similar” stored instance
• Features
– What is a similar instance?
– Need store (all?) instances
– Really a black box method
Fall 2004
Data Mining
54
Clusters:
d
d
e
a
e
j
k
c
h
f
a
b
k
i
c
h
f
b
i
g
Fall 2004
j
g
Data Mining
55
Next: Algorithms
Fall 2004
Data Mining
56
Download