1 - CENG464

advertisement
Classification
Training and Test set are randomly sampled
Tid
Attrib1
Attrib2
Attrib3
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Class
Induction
supervised
Learn
Model
Model
10
Training Set
Apply
Model
accuracy
1
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
Class
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Deduction
10
Test Set
Find a mapping OR function that can predict class label of given tuple X
Decision Tree
- more than 1 DT can be induced from the same training data
(Because of different the DT algorithm and attribute selection measure)
- structure of DT affects the performance of classification model
o speed of classification is related to the # of the levels in the tree
o aim is to construct a DT that has minimal # of levels such that it
classifies as many examples as early at the highest levels of the
tree as possible
- Accuracy of classification by DT is critical
- Not every attribute is needed in decision making
- More branches in a DT means more complexity
  more memory resources required
  more tests needed
  DT is more likely to overfit makes errors in prediction
Information Gain
ID3 algorithm uses Information gain measure.
Probability of event Ek=P(Ek)=Nk/M
where Nk  # of samples conveying event Ek M whole sample size
Self-Information of Ek= I(Ek)= -log P(Ek)
Entropy of whole system=  P(Ek) log P(Ek)=H(S)
Conditional self-Information of event Ek
given that Fj occurred:
I(Ek| Fj)= log ( P(Ek and Fj ) / P(Ek) )
Expected information of system=H( S1|S2)= (Ek|Fj)log(P(Ek|Fj)/ P(Ek) )
A DT can be considered as a 2 Information Systems based on Attributes
and class
H(class) average info of class system before attribute A is considered as
the root of the DT
H(class|A)  expected info of class system after A is chosen as root
Information Gain = G (A)= H(class)- H(class|A)
The attribute that has the highest information gain is the one that reduces
the degree of uncertainty the most
ID3 algorithm selects the attribute whose values influence the outcome of
the class the most, ensuring the classification of as many examples as
close to the root of the tree as possible.
Example: weather data
Find H(class)
Find H(class|Outlook)
Find Information Gain = G (Outlook)
1. Initial data set
no
1
2
3
4
5
6
7
8
9
10
11
12
13
14
outlook temperature humidity
sunny
hot
high
sunny
hot
high
overcast
hot
high
rain
mild
high
rain
cool
normal
rain
cool
normal
overcast
cool
normal
sunny
mild
high
sunny
cool
normal
rain
mild
normal
sunny
mild
normal
overcast
mild
high
overcast
hot
normal
rain
mild
high
windy
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
class
N
N
P
P
P
N
P
N
P
P
P
P
P
N
2. Information Gain measure results in- outlook to be the highest
gain attribute
Partition the training set into three subsets being Outlook the root:
Outlook
Sunny
Overcast
Rain
Outlook=overcast
outlook temperature humidity
overcast
hot
high
overcast
cool
normal
overcast
mild
high
overcast
hot
normal
overcast
cool
normal
windy
FALSE
TRUE
TRUE
FALSE
TRUE
class
P
P
P
P
P
windy
FALSE
TRUE
FALSE
FALSE
TRUE
class
N
N
N
P
P
Outlook=sunny
outlook temperature humidity
sunny
hot
high
sunny
hot
high
sunny
mild
high
sunny
cool
normal
sunny
mild
normal
Outlook=rain
outlook temperature humidity
rain
mild
high
rain
cool
normal
rain
cool
normal
rain
mild
normal
rain
mild
high
windy
FALSE
FALSE
TRUE
FALSE
TRUE
class
P
P
N
P
N
windy
FALSE
TRUE
FALSE
FALSE
TRUE
class
N
N
N
P
P
3. Analyze each subset
Outlook=sunny
outlook temperature humidity
sunny
hot
high
sunny
hot
high
sunny
mild
high
sunny
cool
normal
sunny
mild
normal
Apply Information gain measure- Humidity has the highest gain value
- select Humidity as the root of the subtree, create two subsets:
outlook temperature humidity
sunny
hot
high
sunny
hot
high
sunny
mild
high
windy
FALSE
TRUE
FALSE
class
N
N
N
outlook temperature humidity
sunny
cool
normal
sunny
mild
normal
windy
FALSE
TRUE
class
P
P
Each subset contains only the samples with only one class label (N or
P)
- Two leaf nodes labeled by P and N are created
Outlook
Sunny
Humidity
high
N
normal
P
Overcast
Rain
4. analyze subset Outlook=Overcast
Outlook=overcast
outlook temperature humidity
overcast
hot
high
overcast
cool
normal
overcast
mild
high
overcast
hot
normal
overcast
cool
normal
windy
FALSE
TRUE
TRUE
FALSE
TRUE
class
P
P
P
P
P
All the samples have the same class label P create leaf node with P
Outlook
Sunny
Overcast
Rain
Humidity
P
high
normal
N
P
5. analyze subset Outlook=rain
Outlook=rain
outlook temperature humidity
rain
mild
high
rain
cool
normal
rain
cool
normal
rain
mild
normal
rain
mild
high
windy
FALSE
FALSE
TRUE
FALSE
TRUE
class
P
P
N
P
N
Calculate information gain for each attribute Windy attribute has the
highest gain partition over Windy
Windy=False
outlook temperature humidity
rain
mild
high
rain
cool
normal
rain
mild
normal
windy
FALSE
FALSE
FALSE
class
P
P
P
windy
TRUE
TRUE
class
N
N
Windy=True
outlook temperature humidity
rain
cool
normal
rain
mild
high
In each subset, we have the same class label create leaf nodes with the
class labels
Outlook
Sunny
Overcast
Rain
Windy
Humidity
P
high
N
normal
True
P
False
N
P
Gain Ratio
C4.5 and C5 algorithms uses Gain Ratio. Normalizes uncertaintyinformation gain across different attributes in order to avoid bias
towards attributes with more distinct values
Gain Ratio (A)= Gain(A)/ H(A)
H (Outlook)= -5/14log 5/14 – 4/14log 4/14 -5/14 log 5/14 = 1.577
Gain Ratio (Outlook)=0.246/ 1.577 = 0.156
Information Gain Ratio has a better performance than the information
gain in terms of classification accuracy of the resulting tree
Gini Index
CART,SPRINT,SLIQ algorithms uses Gini Index
Assume class has w labels c1, c2, c3…cw
Gini impurity function measures level of impurity of the data set
Gini (t)=1-  P(Ci\ t)2 OR Gini (class)= 1-  P(Ci)2
Where P(Ci\ t) is the fraction of specific class under condition t
When Dt is constructed, it searches for an attribute whose values split
an impure set into partitions
Choose an attribute that reduces the impurity the most
Gini Index (A) =Gini (class)-  P(aj) Gini (A=aj) where P(aj) is the
probability of A=aj
Example: Weather data
X2 Statistics
Used in CHAID algorithm. It is a measure for the degree of association
or dependence between two variables. For classification, it can be used
to measure the degree of dependence between a given attribute and
the class variable.
Problem of overfitting
-the classification problem built from a limited set of training examples is
too representative of the training examples themselves rather than the
general characteristics of data
 more complex tree is built, too many branches, some may reflect
anomalies due to noise or outliers
 less accurate predictions
-cause of overfitting
o presence of noise records in the training set,
o lack of representative examples in the training set,
o the repeated attribute selection process of a model construction
algorithm
Prunning: approaches to avoid overfitting replace a subtree by a leaf node,
making the tree smaller, more robust and less prone to error
 Prepruning: Halt tree construction early do not split a node if
this would result in the goodness measure falling below a
threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree
 Use a set of data different from the training data to
decide which is the “best pruned tree”
Other issues related to DT
 CART method: produces 2 way splitbinary tree
Internal nodesBoolean test on an attribute value->T/F
Simpler in representation but more levels then multi-way splits
 Tree shape is important: DT that classifies most data records with a
few tests (few levels) and uses more tests for a few records with a
particular set of attribute values is more desirable
 How to performs attribute selection for continuous attributes? Use a
threshold value to split the values into to partition, use relational
condition for the testfind calculate attribute selection measure
Advantages of DT
 Explains why the decision is made through a classification rule
 Classifies unseen records efficiently. Classification time depends on
the 3 of levels of the tree
 Can handle categorical and continuous attributes
Weaknesses of DT
 Have high error rates
 Computationally difficult to build as recursive functions used
Download