Data Mining Demystified ver2

advertisement
Data Mining Demystified
John Aleshunas
Fall Faculty Institute
October 2006
Prediction is very hard, especially when it's about
the future.
- Yogi Berra
Data Mining Stories



“My bank called and said that they saw that I
bought two surfboards at Laguna Beach,
California.” - credit card fraud detection
The NSA is using data mining to analyze
telephone call data to track al’Qaeda activities
Victoria’s Secret uses data mining to control
product distribution based on typical customer
buying patterns at individual stores
Preview

Why data mining?

Example data sets

Data mining methods

Example application of data mining

Social issues of data mining
Why Data Mining?

Database systems have been around since the
1970s

Organizations have a vast digital history of the
day-to-day pieces of their processes

Simple queries no longer provide satisfying
results


They take too long to execute
They cannot help us find new opportunities
Source: Han
Why Data Mining?



Data doubles about every year while
useful information seems to be decreasing
Vast data stores overload traditional
decision making processes
We are data rich, but information poor
Source: Han
Data Mining: a definition
Simply stated, data mining refers to the
extraction of knowledge from large
amounts of data.
Data Mining Models
A Taxonomy
Data Mining
Predictive
Classification
Descriptive
Clustering
Time Series
Analysis
Regression
Prediction
Source: Dunham
Association
Rules
Summarization
Sequence
Discovery
Example Datasets

Iris

Wine

Diabetes
Iris Dataset

Created by R.A. Fisher (1936)

150 instances

Three cultivars (Setosa, Virginica, Versicolor) 50
instances each

4 measurements (petal width, petal length, sepal width,
sepal length)

One cultivar (Setosa) is easily separable, the others are
not – noisy data
Source: Fisher
Iris Dataset Analysis
Sepal Width
5
4.5
4
Sepal Width (cm)
3.5
3
Iris-Setosa
2.5
Iris-Versicolor
Iris-Verginica
2
1.5
1
0.5
0
0
10
20
30
40
Num ber of Records (Integers)
(Figure 2)
50
60
Wine Dataset


This data is the result of a chemical
analysis of wines grown in the same
region in Italy but derived from three
different varieties.
153 instances with 13 constituents found
in each of the three types of wines.
Source: UCI Machine Learning
Repository
Wine Dataset Analysis
Flavinoids
Ash
6
3.5
3
5
2.5
Class 1
3
Class 2
Class 3
Values
Value
4
Class 1
2
Class 2
1.5
Class 3
2
1
1
0.5
0
0
0
10
20
30
40
Instance
50
60
70
0
10
20
30
40
Instances
50
60
70
Diabetes Dataset




Data is based on a population of women who were at
least 21 years old of Pima Indian heritage and living near
Phoenix in 1990
768 instances
9 attributes (Pregnancies, PG Concentration, Diastolic BP,
Tri Fold Thick, Serum Ins, BMI, DP Function, Age,
Diabetes)
Dataset has many missing values, only 532 instances are
complete
Source: UCI Machine Learning
Repository
Diabetes Dataset Analysis
PG Concentration
Diastlic BP
250
140
120
200
100
Healthy
Sick
100
Values
Values
150
80
Healthy
Sick
60
40
50
20
0
0
0
100
200
300
Instances
400
500
600
0
100
200
300
Instances
400
500
600
Classification


Classification builds a model using a
training dataset with known classes of
data
That model is used to classify new,
unknown data into those classes
Classification Techniques

K-Nearest Neighbors

Decision Tree Classification (ID3, C4.5)
K-Nearest Neighbors Example
A
A
A
B
A
A
B
B
X
• Simple to implement
B
A
Easy to explain
A
A
B
B
B
A
A
A
•
B
B
B
• Sensitive to the selection of the
classification population
• Not always conclusive for complex
data
K-Nearest Neighbors Example
MISCLASSIFICATION PERCENTAGES
Iris Dataset
All Attributes
Petal Length and Petal Width
Setosa
0/150 = 0%
0/150 = 0%
Versicolor
0/150 = 0%
0/150 = 0%
Virginica
9/150 = 6%
7/150 = 4.67%
Total
6%
4.67%
Wine Dataset
All Attributes
Phenols, Flavanoids, OD280/OD315
Class 1
0/153 = 0%
2/153 = 1.31%
Class 2
9/153 = 5.88%
30/153 = 19.61%
Class 3
0/153 = 0%
0/153 = 0%
Total
5.88%
20.92%
Source: Indelicato
Decision Tree Example (C4.5)



C4.5 is a decision tree generating algorithm, based on the ID3
algorithm. It contains several improvements, especially needed for
software implementation.
Choice of best splitting attribute is based on an entropy calculation.
These improvements include:
 Choosing an appropriate attribute selection measure.
 Handling training data with missing attribute values.
 Handling attributes with differing costs.
 Handling continuous attributes.
Decision Tree Example (C4.5)
Iris dataset
Wine dataset
Accuracy 97.67%
Accuracy 86.7%
Source: Siedler
Decision Tree Example (C4.5)
Diabetes dataset

C4.5 produces a complex tree (195 nodes)

The simplified (pruned) tree reduces the classification accuracy
Before Pruning
After Pruning
Size
Errors
Size
Errors
195
40 (5.2%)
69
102 (13.3%)
Accuracy
94.8%
86.7%
Association Rules
Association rules are used to show the
relationships between data items.
Purchasing one product when another product is
purchased is an example of an association rule.
They do not represent any causality or correlation.
Association Rule Techniques

Market Basket Analysis

Terminology

Transaction database

Association rule – implication {A, B} ═> {C}

Support - % of transactions in which {A, B, C} occurs

Confidence – ratio of the number of transactions that contain
{A, B, C} to the number of transactions that contain {A, B}
Association Rule Example
1984 United States Congressional Voting Records Database
Attribute Information:
Rules:
1. Class Name: 2 (democrat, republican)
2. handicapped-infants: 2 (y,n)
3. water-project-cost-sharing: 2 (y,n)
4. adoption-of-the-budget-resolution: 2 (y,n)
5. physician-fee-freeze: 2 (y,n)
6. El-Salvador-aid: 2 (y,n)
7. religious-groups-in-schools: 2 (y,n)
8. anti-satellite-test-ban: 2 (y,n)
9. aid-to-Nicaraguan-contras: 2 (y,n)
10. MX-missile: 2 (y,n)
11. immigration: 2 (y,n)
12. synfuels-corporation-cutback: 2 (y,n)
13. education-spending: 2 (y,n)
14. superfund-right-to-sue: 2 (y,n)
15. crime: 2 (y,n)
16. duty-free-exports: 2 (y,n)
17. export-administration-act-south-africa: 2 (y,n)
{budget resolution = no, MX-missile = no,
aid to El Salvador = yes}  {Republican}
confidence 91.0%
{budget resolution = yes, MX-missile = yes,
aid to El Salvador = no}  {Democrat}
confidence 97.5%
{crime = yes, right-to-sue = yes,
Physician fee freeze = yes}  {Republican}
confidence 93.5%
{crime = no, right-to-sue = no,
Physician fee freeze = no}  {Democrat}
confidence 100.0%
Source: UCI Machine Learning
Repository
Clustering
Clustering is similar to classification in that data
are grouped.
Unlike classification, the groups are not
predefined; they are discovered.
Grouping is accomplished by finding similarities
between data according to characteristics found
in the actual data.
Clustering Techniques

K-Means Clustering

Neural Network Clustering (SOM)
K-Means Example



The K-Means algorithm is an method to
cluster objects based on their attributes
into k partitions.
It assumes that the k clusters exhibit
normal distributions.
The objective it tries to achieve is to
minimize the variance within the clusters.
K-Means Example
Dataset
Mean 1
Mean 2
Mean 3
Cluster 1
Cluster 2
Cluster 3
K-Means Example
Iris dataset, only the petal width attribute, Accuracy 95.33%
Cluster 1
Cluster 2
Cluster 3
46 Versicolor
3 Virginica
Cluster mean 4.22857
4 Versicolor
47 Virginica
Cluster mean 5.55686
50 Setosa
Cluster mean 1.46275
Iris dataset, all attributes, Accuracy 66.0 %
Cluster 1
Cluster 2
Cluster 3
47 Versicolor
49 Virginica
Mean 6.30, 2.89, 4.96, 1.70
21 Setosa
1 Virginica
Mean 4.59, 3.07, 1.44, 0.29
29 Setosa
3 Versicolor
Mean 5.21, 3.53, 1.67, 0.35
Iris dataset, all attributes, Accuracy 90.67 %
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
23 Virginica
1 Virginica
26 Setosa
12 Virginica
24
Versicolor
1 Virginica
26
Versicolor
13 Virginica
24 Setosa
Self-Organizing Map Example

The Self-Organizing Map was first described by the Finnish professor
Teuvo Kohonen and is thus sometimes referred to as a Kohonen
map.

SOM is especially good for visualizing high-dimensional data.

SOM maps input vectors onto a two-dimensional grid of nodes.

Nodes that are close together have similar attribute values and
nodes that are far apart have different attribute values.
Self-Organizing Map Example
Z
Y
X
Z
Input Vectors
X
Y
Self-Organizing Map Example
Iris Data
Virginica
Virginica
Virginica
Versicolor
Virginica
Virginica
Virginica
Versicolor
Virginica
Virginica
Virginica
Virginica
Virginica
Virginica
Setosa
Setosa
Setosa
Versicolor
Setosa
Virginica
Versicolor
Virginica
Versicolor
Versicolor
Versicolor
Versicolor
Versicolor
Virginica
Versicolor
Versicolor
Virginica
Virginica
Versicolor
Setosa
Setosa
Setosa
Setosa
Setosa
Setosa
Setosa
Setosa
Setosa
Setosa
Setosa
Versicolor
Setosa
Versicolor
Versicolor
Versicolor
Virginica
Virginica
Virginica
Versicolor
Versicolor
Versicolor
Versicolor
Versicolor
Versicolor
Virginica
Versicolor
Versicolor
Versicolor
Versicolor
Versicolor
Versicolor
Versicolor
Versicolor
Virginica
Versicolor
Versicolor
Virginica
Versicolor
Versicolor
Virginica
Setosa
Versicolor
Self-Organizing Map Example
Wine Data
Class-2
Class-2
Class-2
Class-2
Class-2
Class-2
Class-2
Class-2
Class-2
Class-2
Class-2
Class-3
Class-3
Class-3
Class-3
Class-3
Class-2
Class-3
Class-3
Class-3
Class-3
Class-2
Class-3
Class-3
Class-2
Class-3
Class-3
Class-3
Class-1
Class-2
Class-1
Class-2
Class-3
Class-2
Class-3
Class-3
Class-2
Class-3
Class-3
Class-1
Class-3
Class-2
Class-3
Class-1
Class-3
Class-1
Class-1
Class-3
Class-1
Class-1
Class-1
Class-1
Class-2
Class-3
Class-2
Class-3
Class-1
Class-2
Class-2
Class-1
Class-1
Class-1
Class-1
Class-1
Class-1
Self-Organizing Map Example
Diabetes Data
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Sick
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Sick
Healthy
Sick
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Sick
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Healthy
Sick
Sick
Healthy
Sick
Sick
Healthy
Sick
Healthy
Healthy
Healthy
Sick
Healthy
Healthy
Sick
Sick
Healthy
Healthy
Healthy
Healthy
Sick
Sick
Healthy
Sick
Healthy
Sick
Sick
Healthy
Healthy
Sick
Sick
Sick
Sick
Sick
Sick
Healthy
Sick
Sick
Healthy
NFL Quarterback Analysis



Data from 2005 for 42 NFL quarterbacks
Preprocessed data to normalize for a full
16 game regular season
Used SOM to cluster individuals based on
performance and descriptive data
Source: McKee
NFL Quarterback Analysis
The SOM Map
Source: McKee
NFL Quarterback Analysis
QB Passing Rating
Overall Clustering
Source: McKee
NFL Quarterback Analysis
The SOM Map
Source: McKee
Data Mining Stories - Revisited

Credit card fraud detection

NSA telephone network analysis

Supply chain management
Social Issues of Data Mining

Impacts on personal privacy and confidentiality

Classification and clustering is similar to profiling

Association rules resemble logical implications

Data mining is an imperfect process subject to
interpretation
Conclusion

Why data mining?

Example data sets

Data mining methods

Example application of data mining

Social issues of data mining
What on earth would a man do with himself if
something did not stand in his way?
- H.G. Wells
I don’t think necessity is the mother of invention –
invention, in my opinion, arises directly from
idleness, probably also from laziness, to save
oneself trouble.
- Agatha Christie, from “An Autobiography, Pt III,
Growing Up”
References







Dunham, Margaret, Data Mining Introductory and Advanced Topics, Pearson Education, Inc., 2003
Fisher, R.A., The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics 7, pp.
179-188
Han, Jiawei, Data Mining: Concepts and Techniques, Elsevier Inc., 2006
Indelicato, Nicolas, Analysis of the K-Nearest Neighbors Algorithm, MATH 4500: Foundations of
Data Mining, 2004
McKee , Kevin, The Self Organized Map Applied to 2005 NFL Quarterbacks, MATH 4200: Data
Mining Foundations, 2006
Newman, D.J. & Hettich, S. & Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning
databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of
California, Department of Information and Computer Science
Seidler, Toby, The C4.5 Project: An Overview of the Algorithm with Results of Experimentation,
MATH 4500: Foundations of Data Mining, 2004
Download