งานนำเสนอ PowerPoint

advertisement
Chapter 4
Basic Data Mining
Technique
Content
•
•
•
•
•
•
•
What is classification?
What is prediction?
Supervised and Unsupervised Learning
Decision trees
Association rule
K-nearest neighbor classifier
Case-based reasoning
•
Genetic algorithm
•
Rough set approach
•
Fuzzy set approaches
Data Warehouse and Data Mining
2
Chapter 4
Data Mining Process
Data Warehouse and Data Mining
3
Chapter 4
Data Mining Strategies
Data Warehouse and Data Mining
4
Chapter 4
Classification vs. Prediction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on
the training set and the values (class labels) in
a classifying attribute and ....uses it in
classifying new data
• Prediction:
– models continuous-valued functions, i.e.,
predicts unknown or missing values
Data Warehouse and Data Mining
5
Chapter 4
Classification vs. Prediction
• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Data Warehouse and Data Mining
6
Chapter 4
Classification Process
1. Model construction:
2. Model usage:
Data Warehouse and Data Mining
7
Chapter 4
Classification Process
1. Model construction:
describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a
predefined class, as determined by the class label
attribute
• The set of tuples used for model construction:
training set
• The model is represented as classification rules,
decision trees, or mathematical formulae
Data Warehouse and Data Mining
8
Chapter 4
1. Model Construction
Classification
Algorithms
Training
Data
NAME RANK
M ike
M ary
B ill
Jim
D ave
Anne
A ssistan t P ro f
A ssistan t P ro f
P ro fesso r
A sso ciate P ro f
A ssistan t P ro f
A sso ciate P ro f
Data Warehouse and Data Mining
YEARS TENURED
3
7
2
7
6
3
no
yes
yes
yes
no
no
9
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Chapter 4
Classification Process
2. Model usage:
for classifying future or unknown objects
Estimate accuracy of the model
• The known label of test sample is compared with
the classified result from the model
• Accuracy rate is the percentage of test set
samples that are correctly classified by the model
• Test set is independent of training set
Data Warehouse and Data Mining
10
Chapter 4
2. Use the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME RANK
T om
M erlisa
G eorge
Joseph
A ssistant P rof
A ssociate P rof
P rofessor
A ssistant P rof
Data Warehouse and Data Mining
YEARS TENURED
2
7
5
7
no
no
yes
yes
11
Tenured?
Chapter 4
What Is Prediction?
• Prediction is similar to classification
– 1. Construct a model
– 2. Use model to predict unknown value
• Major method for prediction is regression
– Linear and multiple regression
– Non-linear regression
• Prediction is different from classification
– Classification refers to predict categorical class label
– Prediction models continuous-valued functions
Data Warehouse and Data Mining
12
Chapter 4
Issues regarding classification and prediction
1. Data Preparation
2. Evaluating Classification Methods
Data Warehouse and Data Mining
13
Chapter 4
1. Data Preparation
• Data cleaning
– Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
Data Warehouse and Data Mining
14
Chapter 4
2. Evaluating Classification Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight proved by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
Data Warehouse and Data Mining
15
Chapter 4
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Data Warehouse and Data Mining
16
Chapter 4
Supervised Learning
Data Warehouse and Data Mining
17
Chapter 4
Unsupervised Learning
Data Warehouse and Data Mining
18
Chapter 4
Classification by Decision Tree Induction
• Decision tree
– A flow-chart-like tree structure
– Internal node denotes a test on an attribute
– Branch represents an outcome of the test
– Leaf nodes represent class labels or class
distribution
• Use of decision tree: Classifying an unknown sample
– Test the attribute values of the sample against the
decision tree
Data Warehouse and Data Mining
19
Chapter 4
Classification by Decision Tree Induction
• Decision tree generation consists of two
phases
1. Tree construction
• At start, all the training examples are at
the root
• Partition examples recursively based on
selected attributes
2. Tree pruning
• Identify and remove branches that reflect
noise or outliers
Data Warehouse and Data Mining
20
Chapter 4
Training Dataset
This follows an example from Quinlan’s ID3
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
Data Warehouse and Data Mining
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
21
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
Chapter 4
Output: A Decision Tree for “buys_computer”
age?
<=30
student?
30..40
overcast
>40
credit rating?
yes
no
yes
excellent
fair
no
yes
no
yes
Data Warehouse and Data Mining
22
Chapter 4
Decision Tree
Data Warehouse and Data Mining
23
Chapter 4
What Is Association Mining?
•
Association rule mining:
–
•
Finding frequent patterns, associations,
correlations, or causal structures among sets
of items or objects in transaction databases,
relational databases, and other information
repositories.
Applications:
–
Basket data analysis, cross-marketing, catalog
design, loss-leader analysis, clustering,
classification, etc.
Data Warehouse and Data Mining
24
Chapter 4
Presentation of Classification Results
Data Warehouse and Data Mining
25
Chapter 4
Instance-Based Methods
• Instance-based learning:
– Store training examples and delay the
processing (“lazy evaluation”) .....until a new
instance must be classified
• Typical approaches
– k-nearest neighbor approach
• Instances represented as points in a
Euclidean space.
– Case-based reasoning
• Uses symbolic representations and
knowledge-based inference
Data Warehouse and Data Mining
26
Chapter 4
The k-Nearest Neighbor Algorithm
• All instances correspond to points in the n-D space.
• The nearest neighbor are defined in terms of
Euclidean distance.
• The target function could be discrete- or real- valued.
• For discrete-valued, the k-NN returns the most
common value among the k training examples nearest
to xq.
• Vonoroi diagram: the decision surface induced by
1-NN for a typical set of training examples.
.
_
_
_
+
_
.
+
xq
Data Warehouse
_ and Data Mining
+
_
.
+
27
.
.
.
Chapter 4
Case-Based Reasoning
• Also uses: lazy evaluation + analyze similar
instances
• Difference: Instances.... are not “points in a
Euclidean space”
• Methodology
– Instances represented by rich symbolic
descriptions (e.g., function graphs)
– Multiple retrieved cases may be combined
Data Warehouse and Data Mining
28
Chapter 4
Genetic Algorithms
• GA: based on an analogy to biological evolution
• Each rule is represented by a string of bits
• An initial population is created consisting of randomly
generated rules
– e.g., IF A1 and Not A2 then C2 can be encoded as 100
• Based on the notion of survival of the fittest, a new
population is formed to consists of the fittest rules and
their offsprings
• The fitness of a rule is represented by its classification
accuracy on a set of training examples
• Offsprings are generated by crossover and mutation
Data Warehouse and Data Mining
29
Chapter 4
Supervised genetic learning
Data Warehouse and Data Mining
30
Chapter 4
Rough Set Approach
• Rough sets are used to approximately or “roughly”
define equivalent classes
Data Warehouse and Data Mining
31
Chapter 4
Rough Set Approach
•
A rough set for a given class C is
approximated by two sets:
1. a lower approximation (certain to be in C) and
2. an upper approximation (cannot be described as
not belonging to C)
•
Finding the minimal subsets of attributes
(for feature reduction) is NP-hard
Data Warehouse and Data Mining
32
Chapter 4
Fuzzy Set Approaches
• Fuzzy logic uses truth values between 0.0 and
1.0 to represent the degree of membership
(such as using fuzzy membership graph)
Fuzzy
membeship
Low
Medium
somewhat
low
High
baseline high
Income
Data Warehouse and Data Mining
33
Chapter 4
Fuzzy Set Approaches
• Attribute values are converted to fuzzy values
– e.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated
• For a given new sample, more than one fuzzy
value may apply
• Each applicable rule contributes a vote for
membership in the categories
• Typically, the truth values for each predicted
category are summed
Data Warehouse and Data Mining
34
Chapter 4
Reference
Data Mining: Concepts and Techniques (Chapter 7 for textbook),
Jiawei Han and Micheline Kamber, Intelligent Database Systems
Research Lab, School of Computing Science, Simon Fraser University,
Canada
Data Warehouse and Data Mining
35
Chapter 4
Download