Classification

advertisement
Classification
Slide 1
Supervised vs. Unsupervised Learning

Supervised learning (classification)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
• New data is classified based on the training set

Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Slide 2
Prediction Problems: Classification vs. Numeric
Prediction



Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Numeric Prediction
• models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Slide 3
Classification—A Two-Step Process



Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formula
Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test)
set
Slide 4
Process (1): Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
Anne
RANK
YEARS TENURED
A ssistan t P ro f
3
no
A ssistan t P ro f
7
yes
P ro fesso r
2
yes
A sso ciate P ro f
7
yes
A ssistan t P ro f
6
no
A sso ciate P ro f
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Slide
5 5
Process (2): Using the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
M erlisa
G eo rg e
Jo sep h
RANK
YEARS TENURED
A ssistan t P ro f
2
no
A sso ciate P ro f
7
no
P ro fesso r
5
yes
A ssistan t P ro f
7
yes
Tenured?
Slide
6 6
Decision Tree Induction: An Example
Training data set: Buys_computer
 The data set follows an example of
Quinlan’s ID3
 Resulting tree:

age?
<=30
31..40
overcast
student?
no
no
yes
yes
yes
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating buys_computer
high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
credit rating?
excellent
fair
yes
Slide
7 7
Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-andconquer manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Slide
8 8
MBA Application
Classification Example
Slide 9
Plan

Discussion on Classification using Data Mining Tool
(It’s suggested to work with your final project
teammates.)
Slide 10
Decisions on Admissions


Download data mining tool—Weka.zip and an
admission dataset ABC.csv from Course Page.
Work as a team to make decisions on admissions to
MBA program at ABC University. Remember to save
your file.
• Assuming that each record (row) represents a
student, use the provided information (GPA,
GMAT, Years of work experience) to make
admission decisions.
• If you decide to admit the student, fill in Yes in
Admission column. Otherwise, fill in No to
reject the application.
Slide 11
Decisions on Admissions (cont.)
Before using Weka, write down your criteria judging
the applications.
Slide 12
Decision Tree





A machine learning (data mining) technique.
Making decisions.
Generating interpretable rules.
Able to re-use for future predictions.
Automatic and faster response (decision).
Slide 13
Using Weka



Extract Weka.zip to desktop
Run the “RunWeka3-4-6.bat” from Weka folder on
your desktop
Open the ABC.csv with decisions from Weka
• Set filetype to CSV
Slide 14
Using Weka (cont.)





Click Classify tab
Choose J48 Classifier below trees
Set the Test options to Use training set
Enable Output predictions in More options
Click Start to run
Slide 15
Using Weka (cont.)

Read the classification output report from Classifier
output
Slide 16
Using Weka (cont.)

Use the decision tree generated from the tool for your
write-up (from J48 pruned tree)
Slide 17
Using Weka (cont.)

The generated tree to my example is
• GMAT <= 390: No (9.0)
• GMAT > 390
• | Exp <= 2
GMAT>390
• | | GMAT <= 450: No (2.0)
• | | GMAT > 450: Yes (6.0/1.0) No
Yes
• | Exp > 2: Yes (13.0)
Rejected
Exp > 2
No
Yes
Admitted
GMAT>450
No
Rejected
Yes
Admitted
Slide 18
Using Weka (cont.)

Performance of Decision Tree
Incorrect
prediction
Slide 19
Using Weka (cont.)
Accuracy rate for decision tree prediction =
(the number of correct predictions) / (total observations)
= 29 / 30
= 96.6667%

Slide 20
Using Weka (cont.)
Share your criteria (results) with other teams.
Slide 21
Simplicity first!
Simple algorithms often work very well!
 There are many kinds of simple structure, eg:
–
–
–
–
–
One attribute does all the work
Attributes contribute equally and independently
A decision tree that tests a few attributes
Calculate distance from training instances
Result depends on a linear combination of attributes
 Success of method depends on the domain
– Data mining is an experimental science
Slide 22
Simplicity first!
OneR: One attribute does all the work
 Learn a 1‐level “decision tree”
–
i.e., rules that all test one particular attribute
 Basic version
– One branch for each value
– Each branch assigns most frequent class
– Error rate: proportion of instances that don’t belong to the
majority class of their corresponding branch
– Choose attribute with smallest error rate
Slide 23
Simplicity first!
For each attribute,
For each value of the attribute,
make a rule as follows:
count how often each class appears
Find the most frequent class
Make the rule assign that class
to this attribute-value
Calculate the error rate of this attribute’s rules
Choose the attribute with the smallest error rate
Slide 24
Simplicity first!
Outlook
Temp
Humidity Wind
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast Hot
High
False
Rainy
Mild
High
False
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Attribute
Rules
Errors
Sunny  No
2/5
Yes
Overcast  Yes
0/4
Yes
Rainy  Yes
2/5
Hot  No*
2/4
Mild  Yes
2/6
Cool  Yes
1/4
High  No
3/7
Normal  Yes
1/7
False  Yes
2/8
True  No*
3/6
Outlook
Temp
Humidity
Wind
Total
errors
4/14
5/14
4/14
5/14
* indicates a tie
Slide 25
Simplicity first!
Use OneR
 Open file weather.nominal.arff
 Choose OneR rule learner (rules>OneR)
 Look at the rule
Slide 26
Simplicity first!
OneR: One attribute does all the work
 Incredibly simple method, described in 1993
“Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”
– Experimental evaluation on 16 datasets
– Used cross‐validation
– Simple rules often outperformed far more complex methods
 How can it work so well?
– some datasets really are simple
– some are so small/noisy/complex that nothing can be
learned from them!
Rob Holte,
Alberta, Canada
Slide 27
Overfitting
 Any machine learning method may “overfit” the training data …
… by producing a classifier that fits the training data too tightly
 Works well on training data but not on independent test data
 Remember the “User classifier”? Imagine tediously putting a tiny circle
around every single training data point
 Overfitting is a general problem
 … we illustrate it with OneR
Slide 28
Overfitting
Numeric attributes
Outlook
Temp
Humidity
Wind
Play
Sunny
85
85
False
No
Sunny
80
90
True
No
Overcast
83
86
False
Yes
Rainy
75
80
False
Yes
…
…
…
…
…
Attribute
Temp
Rules
Errors
Total
errors
85  No
0/1
0/14
80  Yes
0/1
83  Yes
0/1
75  No
0/1
…
…
 OneR has a parameter that limits the complexity of such rules
Slide 29
Overfitting
Experiment with OneR




Open file weather.numeric.arff
Choose OneR rule learner (rules>OneR)
Resulting rule is based on outlook attribute, so remove outlook
Rule is based on humidity attribute
humidity:
< 82.5 ‐> yes
>= 82.5 ‐> no
(10/14 instances correct)
Slide 30
Overfitting
Experiment with diabetes dataset









Open file diabetes.arff
Choose ZeroR rule learner (rules>ZeroR)
Use cross‐validation:
65.1%
Choose OneR rule learner (rules>OneR)
Use cross‐validation:
72.1%
Look at the rule (plas = plasma glucose concentration)
Change minBucketSize parameter to 1: 54.9%
Evaluate on training set: 86.6%
Look at rule again
Slide 31
Overfitting
 Overfitting is a general phenomenon that damages all ML methods
 One reason why you must never evaluate on the training set
 Overfitting can occur more generally
 E.g try many ML methods, choose the best for your data
– you cannot expect to get the same performance on new test data
 Divide data into training, test, validation sets?
Slide 32
Using probabilities
(OneR: One attribute does all the work)
Opposite strategy: use all the attributes
“Naïve Bayes” method
 Two assumptions: Attributes are
–
–
equally important a priori
statistically independent (given the class value)
i.e., knowing the value of one attribute says nothing about the
value of another (if the class is known)
 Independence assumption is never correct!
 But … often works well in practice
Slide 33
Using probabilities
Probability of event H given evidence E
Pr[H | E] 
class
Pr[E | H ]Pr[H ]
Pr[E]
instance
 Pr[ H ] is a priori probability of H
– Probability of event before evidence is seen
 Pr[ H | E ] is a posteriori probability of H
– Probability of event after evidence is seen
 “Naïve” assumption:
– Evidence splits into parts that are independent
Pr[ H | E ] 
22
Pr[ E1 | H ] Pr[ E 2 | H ]... Pr[ E n | H ] Pr[ H ]
Pr[ E ]
Thomas Bayes, British mathematician, 1702 –1761
Slide 34
Using probabilities
Outlook
Temperature
Yes
Humidity
Yes
No
No
Sunny
2
3
Hot
2
2
Overcast
4
0
Mild
4
2
Rainy
3
2
Cool
3
1
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
Wind
Yes
No
High
3
4
Normal
6
High
Normal
Play
Yes
No
Yes
False
6
2
9
1
True
3
3
3/9
4/5
False
6/9
2/5
6/9
1/5 True
3/9
3/5
9/14
No
5
5/14
Outlook
Pr[ H | E ] 
Pr[ E1 | H ] Pr[ E 2 | H ]... Pr[ E n | H ] Pr[ H ]
Pr[ E ]
Temp
Humidity Wind
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Slide 35
Using probabilities
Outlook
Temperature
Yes
Humidity
Yes
No
No
Sunny
2
3
Hot
2
2
Overcast
4
0
Mild
4
2
Rainy
3
2
Cool
3
1
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
Wind
Yes
No
High
3
4
Normal
6
High
Normal
Play
Yes
No
Yes
False
6
2
9
1
True
3
3
3/9
4/5
False
6/9
2/5
6/9
1/5
True
3/9
3/5
A new day:
Outlook
Temp.
Sunny
Cool
9/14
No
5
5/14
Humidity
High
Wind
True
Play
?
Likelihood of the two classes
For “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053
Pr[ E1 | H ] Pr[ E 2 | H ]... Pr[ E n | H ] Pr[ H ]
Pr[ H | E ] 
Pr[ E ]
For “no” = 3/5  1/  4/5  3/5  5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
Slide 36
Using probabilities
Outlook
Temp.
Sunny
Cool
Humidity
High
Wind
True
Play
?
Evidence E
Pr[ yes | E]  Pr[Outlook  Sunny | yes]
 Pr[Temperature  Cool | yes]
 Pr[Humidity  High | yes]
Probability of
 Pr[Windy  True | yes]
class “yes”

Pr[ yes]
Pr[E]
 93  93  93  149

Pr[E]
2
9
Slide 37
Using probabilities
Use Naïve Bayes




Open file weather.nominal.arff
Choose Naïve Bayes method (bayes>NaiveBayes)
Look at the classifier
Avoid zero frequencies: start all counts at 1
Slide 38
Using probabilities
 “Naïve Bayes”: all attributes contribute equally and independently
 Works surprisingly well
– even if independence assumption is clearly violated
 Why?
– classification doesn’t need accurate probability estimates
so long as the greatest probability is assigned to the correct class
 Adding redundant attributes causes problems
(e.g. identical attributes)  attribute selection
Slide 39
Decision Trees
Top‐down: recursive divide‐and‐conquer
 Select attribute for root node
– Create branch for each possible attribute v
alue
 Split instances into subsets
– One for each branch extending from the n
ode
 Repeat recursively for each branch
– using only instances that reach the branch
 Stop
– if all instances have the same class
Slide 40
Decision Trees
Which attribute to select?
Slide 41
Decision Trees
Which is the best attribute?
 Aim: to get the smallest tree
 Heuristic
– choose the attribute that produces the “purest” nodes
– I.e. the greatest information gain
 Information theory: measure information in bits
entropy( p1 , p 2 ,..., p n )   p1logp1  p 2 logp 2 ...  p n logp n
Information gain
 Amount of information gained by knowing the value of the attribute
 (Entropy of distribution before the split) – (entropy of distribution after it)
Claude Shannon,American mathematician and scientist 1916–2001
Slide 42
Decision Trees
Which attribute to select?
0.247 bits
0.048 bits
0.152 bits
0.029 bits
Slide 43
Decision Trees
Continue to split …
gain(temperature) = 0.571 bits
gain(windy)
= 0.020 bits
gain(humidity) = 0.971 bits
Slide 44
Decision Trees
Use J48 on the weather data




Open file weather.nominal.arff
Choose J48 decision tree learner (trees>J48)
Look at the tree
Use right‐click menu to visualize the tree
Slide 45
Decision Trees
 J48: “top‐down induction of decision trees”
 Soundly based in information theory
 Produces a tree that people can understand
 Many different criteria for attribute selection
– rarely make a large difference
 Needs further modification to be useful in practice
Slide 46
Pruning Decision Trees
Slide 47
Pruning Decision Trees
Highly branching attributes — Extreme case: ID code
Slide 48
Pruning Decision Trees
How to prune?
 Don’t continue splitting if the nodes get very small
(J48 minNumObj parameter, default value 2)
 Build full tree and then work back from the leaves, applying a
statistical test at each stage
(confidenceFactor parameter, default value 0.25)
 Sometimes it’s good to prune an interior node, raising the
subtree beneath it up one level
(subtreeRaising, default true)
 Messy … complicated … not particularly illuminating
Slide 49
Pruning Decision Trees
Over‐fitting (again!)
Sometimes simplifying a decision tree gives better results







Open file diabetes.arff
Choose J48 decision tree learner (trees>J48)
73.8% accuracy, tree has 20 leaves, 39 nodes
Prunes by default:
72.7%
22 leaves, 43 nodes
Turn off pruning:
Extreme example: breast‐cancer.arff
Default (pruned):
Unpruned:
75.5% accuracy, tree has
69.6%
4 leaves, 6 nodes
152 leaves, 179 nodes
Slide 50
Pruning Decision Trees
 C4.5/J48 is a popular early machine learning method
 Many different pruning methods
– mainly change the size of the pruned tree
 Pruning is a general technique that can apply to
structures other than trees (e.g. decision rules)
 Univariate vs. multivariate decision trees
– Single vs. compound tests at the nodes
 From C4.5 to J48
Ross Quinlan,
Australian computer scientist
Slide 51
Nearest neighbor
“Rote learning”: simplest form of learning
 To classify a new instance, search training set for one
that’s “most like” it
–
–
the instances themselves represent the “knowledge”
lazy learning: do nothing until you have to make predictions
 “Instance‐based” learning = “nearest‐neighbor” learning
Slide 52
Nearest neighbor
Slide 53
Nearest neighbor
Search training set for one that’s “most like” it
 Need a similarity function
–
–
–
–
Regular (“Euclidean”) distance? (sum of squares of differences)
Manhattan (“city‐block”) distance? (sum of absolute differences)
Nominal attributes? Distance = 1 if different, 0 if same
Normalize the attributes to lie between 0 and 1?
Slide 54
Nearest neighbor
What about noisy instances?
 Nearest‐neighbor
 k‐nearest‐neighbors
–
choose majority class among several neighbors (k of them)
 In Weka,
lazy>IBk (instance‐based learning)
Slide 55
Nearest neighbor
Investigate effect of changing k
 Glass dataset
 lazy > IBk, k = 1, 5, 20
 10‐fold cross‐validation
k=1
70.6%
k=5
67.8%
k = 20
65.4%
Slide 56
Nearest neighbor
 Often very accurate … but slow:
–
–
scan entire training data to make each prediction?
sophisticated data structures can make this faster
 Assumes all attributes equally important
–
Remedy: attribute selection or weights
 Remedies against noisy instances:
–
–
–
Majority vote over the k nearest neighbors
Weight instances according to prediction accuracy
Identify reliable “prototypes” for each class
 Statisticians have used k‐NN since 1950s
–
If training set size n   and k   and k/n  0, error approaches minimum
Slide 57
Download