Classification UE 141 Spring 2013 Jing Gao

advertisement
Classification
UE 141 Spring 2013
Jing Gao
SUNY Buffalo
1
Classification
features
patient
class labels
temp. blood pres. heart rate Sick?
99
110
90
Yes
100
120
100
Yes
96
130
65
No
labeled
training
a model: f(x)=y: features  class labels
patient
temp. blood pres. heart rate Sick?
98
130
80
115
110
95
test
unlabeled
2
Illustrating Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Deduction
10
Test Set
3
Classification Techniques
•
•
•
•
•
•
•
Decision Tree
Naïve Bayes
Logistic Regression
Support Vector Machines
K nearest neighbor
Ensemble learning
……
4
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Model: Decision Tree
5
Another Example of Decision Tree
MarSt
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that fits
the same data!
10
6
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
7
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
8
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
9
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
11
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
Assign Cheat to “No”
NO
> 80K
YES
12
Build a Decision Tree
• Let Dt be the set of training
records that reach a node t
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
• General Procedure:
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
– If Dt contains records that
belong the same class yt,
then t is a leaf node labeled
as yt
– If Dt contains records that
belong to more than one
class, use an attribute to split
the data into smaller
subsets. Recursively apply
the procedure to each
subset
60K
10
Dt
?
13
Example
Refund
Yes
No
Don’t
Cheat
Refund
Refund
Yes
No
Don’t
Marital
Cheat
Status
Single,
Married
Divorced
Don’t
Cheat
Yes
No
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
No
Single
90K
Yes
Don’t
Marital
10
Cheat
Status
Single,
Married
Divorced
Don’t
Taxable
Cheat
Income
60K
10
< 80K
Don’t
Cheat
>= 80K
Cheat
14
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
OnCampus?
Yes
No
Car
Type?
Family
Student
ID?
Luxury
c1
c10
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
c20
C0: 1
C1: 7
C0: 1
C1: 0
...
C0: 1
C1: 0
c11
C0: 0
C1: 1
...
C0: 0
C1: 1
Which test condition is the best?
15
Question
• The basic idea of classification approach is to
learn from the past to predict the future.
– Are there some cases that are unpredictable or very
difficult to predict?
– If we apply classification algorithms to such cases, what
results we will get?
16
Download