ID3 and C4.5 - Department of Computer Science

advertisement
Huffman Trees and ID3
Prof. Sin-Min Lee
Department of Computer Science
Huffman coding is an
algorithm used for lossless
data compression developed
by David A. Huffman as a
PhD student at MIT in 1952,
and published in A Method for
the Construction of MinimumRedundancy Codes.
"Huffman Codes" are widely used
applications that involve the compression
and transmission of digital data, such as:
fax machines, modems, computer
networks, and high-definition television
(HDTV), etc.
Professor David A. Huffman
(August 9, 1925 - October 7, 1999)
Motivation
The motivations for data compression are obvious:
 reducing the space
required to store files
on disk or tape
 reducing the time
to transmit large files.
Image Source : plus.maths.org/issue23/ features/data/data.jpg
Huffman savings are between 20% - 90%
Basic Idea :
It uses a variable-length code table for encoding a
source symbol (such as a character in a file) where
the variable-length code table has been derived in
a particular way based on the frequency of
occurrence for each possible value of the source
symbol.
Example:
Suppose you have a file with 100K characters.
For simplicity assume that there are only 6 distinct
characters in the file from a through f, with frequencies as
indicated below.
We represent the file using a unique binary string for each
character.
a
b
c
d
e
f
Frequency
(in 1000s)
45
13
12
16
9
5
Fixed-length
codeword
000
001
010
011
100
101
Space =
(45*3 + 13*3 + 12*3 + 16*3 + 9*3 + 5*3) * 1000
= 300K bits
Can we do better ??
YES !!
By using variable-length codes instead of fixed-length codes.
Idea : Giving frequent characters short codewords, and infrequent
characters long codewords.
i.e. The length of the encoded character is inversely proportional to
that character's frequency.
a
b
c
d
e
f
Frequency
(in 1000s)
45
13
12
16
9
5
Fixed-length
codeword
000
001
010
011
100
101
Variable-length
codeword
0
101
100
111
1101
1100
Space =
(45*1 + 13*3 + 12*3 + 16*3 + 9*4 + 5*4) * 1000
= 224K bits
( Savings = 25%)
PREFIX CODES :
Codes in which no codeword is also a prefix of some
other codeword.
("prefix-free codes" would have been a more appropriate name)
Variable-length
codeword
0
101
100
111
1101
1100
It is very easy to encode and decode using prefix codes.
No Ambiguity !!
It is possible to show (although we won't do so here)
that the optimal data compression achievable by a
character code can always be achieved with a prefix
code, so there is no loss of generality in restricting
attention to prefix codes.
Benefits of using Prefix Codes:
Example:
Variable-length
codeword
Encoded as
a
b
c
d
e
f
0
101
100
111
1101
1100
F
A
C
E
1100
0
100
1101 = 110001001101
To decode, we have to decide where each code begins and ends, since
they are no longer all the same length. But this is easy, since, no codes
share a prefix. This means we need only scan the input string from left
to right, and as soon as we recognize a code, we can print the
corresponding character and start looking for the next code.
In the above case, the only code that begins with “1100.." or a prefix is
“f", so we can print “f" and start decoding “0100...", get “a", etc.
Benefits of using Prefix Codes:
Example:
To see why the no-common prefix property is essential, suppose that we
encoded “e" with the shorter code “110“
a
b
c
d
e
f
Variable-length
codeword
0
101
100
111
1101
1100
Variable-length
codeword
0
101
100
111
110
1100
FACE = 11000100110
When we try to decode “1100"; we could not tell whether
1100 = 110 0 = “f"
or
1100 = 110 + 0 = “ea"
Representation:
The Huffman algorithm is represented as:
• binary tree
• each edge represents either 0 or 1
• 0 means "go to the left child"
• 1 means "go to the right child."
• each leaf corresponds to the sequence of 0s and 1s
traversed from the root to reach it, i.e. a particular
code.
Since no prefix is shared, all legal codes are at the leaves,
and decoding a string means following edges, according to
the sequence of 0s and 1s in the string, until a leaf is
reached.
a
b
c
d
e
f
Frequency
(in 1000s)
45
13
12
16
9
5
Fixed-length
codeword
000
001
010
011
100
101
100
0
86
0
58
0
0
1
a:45 b:13
0
c:12
14
14
28
1
d:16
0
e:9
b
c
d
e
f
Frequency
(in 1000s)
45
13
12
16
9
5
Variablelength
codeword
0
101
100
111
1101
1100
0
1
1
a
1
f:5
100
a:45
0
c:12
1
55
0
25
1
30
1
1
0
b:13
0
f:5
d:16
14
1
e:9
Labeling :
leaf -> character it represents : frequency with which it appears in the text.
internal node -> frequency with which all leaf nodes under it appear in the text
1
(i.e. the sum
of their frequencies).
0ptimal Code
100
0
86
0
0
1
1
58
0
a:45 b:13
0
c:12
14
14
28
1
0
1
d:16
0
e:9
1
f:5
100
a:45
0
c:12
1
55
0
25
1
30
1
1
0
b:13
0
f:5
d:16
14
1
e:9
An optimal code for a file is always represented by a full binary tree, in
which every non-leaf node has two children.
The fixed-length code in our example is not optimal since its tree, is not a
full binary tree: there are codewords beginning 10 . . . , but none beginning
11 ..
Since we can now restrict our attention to full binary trees, we can say that if
C is the alphabet from which the characters are drawn, then the tree for an
optimal prefix code has exactly |C| leaves, one for each letter of the
alphabet, and exactly |C| - 1 internal nodes.
Given a tree T corresponding to a prefix code, it is a simple
matter to compute the number of bits required to encode a
file.
For each character c in the alphabet C,
• f(c) denote the frequency of c in the file
• dT(c) denote the depth of c's leaf in the tree.
(dT(c) is also the length of the codeword for character c)
The number of bits required to encode a file is thus
B(T) =
S f(c) d (c)
c

C
T
Which we define as the cost of the tree.
Constructing a Huffman code
Huffman invented a greedy algorithm that constructs an
optimal prefix code called a Huffman code. The algorithm
builds the tree T corresponding to the optimal code in a
bottom-up manner.
It begins with a set of |C| leaves and performs a sequence of
|C| - 1 "merging" operations to create the final tree.
Greedy Choice?
The two smallest nodes are chosen at each step, and this local decision
results in a globally optimal encoding tree.
In general, greedy algorithms use small-grained, or local
minimal/maximal choices to result in a global minimum/maximum
HUFFMAN(C)
1n
|C|
2Q
C
3 for i
4
1 to n - 1
do ALLOCATE-NODE(z)
5
left[z]
x
6
right[z]
7
f[z]
8
INSERT(Q, z)
y
EXTRACT-MIN(Q)
EXTRACT-MIN(Q)
f[x] + f[y]
9 return EXTRACT-MIN(Q)
C is a set of n characters and that each character c
with a defined frequency f[c].

C is an object
A min-priority queue Q, keyed on f, is used to identify the two leastfrequent objects to merge together. The result of the merger of two
objects is a new object whose frequency is the sum of the frequencies
of the two objects that were merged.
For our example, Huffman's algorithm proceeds as shown.
1n
|C|
2Q
C
Line 1 sets the initial queue size, n = 6 (letters in the alphabet)
Line 2 initializes the priority queue Q with the characters in C. (a through f)
3 for i
1 to n - 1
4
do ALLOCATE-NODE(z)
5
left[z]
6
right[z]
7
f[z]
8
INSERT(Q, z)
x
y
EXTRACT-MIN(Q)
EXTRACT-MIN(Q)
f[x] + f[y]
The for loop uses n - 1 (6 - 1 = 5) merge steps to build the tree.
It repeatedly extracts the two nodes x and y of lowest frequency from the queue, and
replaces them in the queue with a new node z representing their merger.
The frequency of z is computed as the sum of the frequencies of x and y in line 7. The
node z has x as its left child and y as its right child.
9 return EXTRACT-MIN(Q)
After mergers, the one node left in the queue -- the root -- is returned in line 9.
The final tree represents the optimal prefix code. The codeword for a letter is the
sequence of edge labels on the path from the root to the letter.
The steps of Huffman's algorithm
f:5
e:9 c:12 b:13 d:16 a:45
c:12 b:13 14 d:16 a:45
1
0
f:5
d:16
14
0
25
0
1
0
1
e:9 c:12
f:5
a:45
b:13
c:12
25
e:9
30
1
b:13
0
14
d:16
1
0
f:5
a:45
55
0
1
0
c:12
a:45
30
25
0
1
b:13
0
f:5
0
1
d:16
14
1
e:9
a:45
1
e:9
100
1
55
0
1
30
25
0
c:12
0
1
b:13
0
f:5
1
d:16
14
1
e:9
Running Time Analysis
The analysis of the running time of Huffman's algorithm
assumes that Q is implemented as a binary min-heap.
• For a set C of n characters, the initialization of Q in
line 2 can be performed in O(n) time using the
BUILD-MIN-HEAP procedure.
• The for loop in lines 3-8 is executed exactly |n| - 1
times. Each heap operation requires time O(log n).
The loop contributes = (|n| - 1) * O(log n)
= O(nlog n)
Thus, the total running time of HUFFMAN on a set of n
characters = O(n) + O(nlog n)
= O(n log n)
Correctness of Huffman's algorithm
To prove that the greedy algorithm HUFFMAN is correct, we
show that the problem of determining an optimal prefix
code exhibits the greedy-choice and optimal-substructure
properties.
Lemma that shows that the greedy-choice property holds.
Lemma :
Let C be an alphabet in which each character c  C has frequency f[c].
Let x and y be two characters in C having the lowest frequencies. Then
there exists an optimal prefix code for C in which the codewords for x
and y have the same length and differ only in the last bit.
Why ?
Must be on the bottom (least frequent)
Full tree, so they must be siblings, and so differ in one bit.
Proof:
The idea of the proof is to take the tree T representing an arbitrary
optimal prefix code and modify it to make a tree representing another
optimal prefix code such that the characters x and y appear as sibling
leaves of maximum depth in the new tree. If we can do this, then their
codewords will have the same length and differ only in the last bit.
Proof
T’
T
x
y
a
y
a
b
x
b
Let a and b be two characters that are are sibling leaves of maximum depth in
T. Without loss in generality, assume that f[a] < f[b] and f[x] < f[y]
Since f[x] and f[y] are the two lowest frequencies in that order, and f[a] and
f[b] are two arbitrary frequencies in that order, we have f[x] < f[a] and
f[y] < f[b].
Exchange the positions of a and x in T, to produce T’.
The difference in cost between T and T’ is
B(T) – B(T’) = S f(c) dT(c) - S f(c) dT’(c)
= f[x] dT(x) + f[a] dT(a) - f[x] dT’(x) - f[a] dT’(a)
= f[x] dT(x) + f[a] dT(a) - f[x] dT(a) - f[a] dT(x)
= (f[a] - f[x]) (dT(a) - dT(x))

0 (cost is not increased)
Proof
T’
T
T’’
x
y
a
a
y
a
b
x
b
b
x
y
Similarly exchanging the positions of b and y in T’, to produce T’’ does not
increase the cost,
B(T’) – B(T’’) is non-negative.
Therefore B(T’’)

B(T), and since T is optimal, B(T)

B(T’’),
=> B(T’’) = B(T)
Thus, T’’ is an optimal tree in which x & y appear as sibling
leaves of maximum depth from which the lemma follows.
Lemma that shows that the optimal substructure property
holds.
Lemma :
Let C be a given alphabet with frequency f[c] defined for each
character c  C . Let x and y be two characters in C with minimum
frequency. Let C’ be the alphabet C with characters x,y removed and
(new) character z added, so that C’ = C – {x,y} U {z}; define f for
C’ as for C, except that f[z] = f[x] + f[y]. Let T’ be any tree
representing an optimal prefix code for the alphabet C’. Then the tree
T, obtained from T’ by replacing the leaf node for z with an internal
node having x and y as children, represents an optimal prefix code for
the alphabet C.
Proof:
We first express B (T) in terms of B (T')
 c  C – {x,y} we have dT(c) = dT’(c), and hence
f[c]dT(c) = f[c]dT’ (c)’
Since dT(x) = dT(y) = dT’(z) + 1, we have
f[x]dT(x) + f[y]dT(y) = (f[x] + f[y]) (dT'(z) + 1 )
= f(z)dT'(z) + (f[x] + f[y])
From which we conclude that
B(T) = B(T’) + (f[x] + f[y])
B(T’) = B(T) - (f[x] - f[y])
Proof by contradiction
Suppose that T does not represent an optimal prefix code for C.
Then there exists a tree T’’ such that B(T’’) < B(T).
Without loss in generality (by the previous lemma) T’’ has x & y as siblings.
Let T’’’ be the tree T’’ with the common parent of x & y replaced by a leaf z
with frequency f[z] = f[x] + f[y].
Then, B(T’’’) = B(T’’) - (f[x] – f[y])
< B(T) - (f[x] - f[y])
= B(T’)
Yielding a contradiction to the assumption that T’ represents an optimal
prefix code for C’. Thus, T must represent an optimal prefix code for the
alphabet C.
Drawbacks
The main disadvantage of Huffman’s method is that it makes two
passes over the data:
• one pass to collect frequency counts of the letters in the
message, followed by the construction of a Huffman tree and
transmission of the tree to the receiver; and
• a second pass to encode and transmit the letters
themselves, based on the static tree structure.
This causes delay when used for network communication, and in
file compression applications the extra disk accesses can slow
down the algorithm.
We need one-pass methods, in which letters are encoded “on the fly”.
ID3 algorithm
•To get the fastest decision-making procedure, one has to arrange
attributes in a decision tree in a proper order - the most
discriminating attributes first. This is done by the algorithm called
ID3.
•The most discriminating attribute can be defined in precise terms
as the attribute for which the fixing its value changes the enthropy
of possible decisions at most. Let wj be the frequency of the j-th
decision in a set of examples x. Then the enthropy of the set is
E(x)= - Swj* log(wj)
•Let fix(x,a,v) denote the set of these elements of x whose value
of attribute a is v. The average enthropy that remains in x , after
the value a has been fixed, is:
H(x,a) = S kv E(fix(x,a,v)),
where kv is the ratio of examples in x with attribute a having
value v.
Ok now we want a quantitative way of seeing the effect of
splitting the dataset by using a particular attribute (which is part
of the tree building process). We can use a measure called
Information Gain, which calculates the reduction in entropy
(Gain in information) that would result on splitting the data on an
attribute, A.
where v is a value of A
, |Sv| is the subset of instances of S where A takes the value v,
and |S| is the number of instances
Continuing with our example dataset, let's name it S just for
convenience, let's work out the Information Gain that splitting on
the attribute District would result in over the entire dataset:
So by calculating this value for each attribute that remains, we can
see which attribute splits the data more purely. Let's imagine we want
to select an attribute for the root node, then performing the above
calcualtion for all attributes gives:
•Gain(S,House Type) = 0.049 bits
•Gain(S,Income) =0.151 bits
•Gain(S,Previous Customer) = 0.048 bits
We can clearly see that District results in the highest
reduction in entropy or the highest information gain.
We would therefore choose this at the root node
splitting the data up into subsets corresponding to all
the different values for the District attribute.
With this node evaluation technique we can procede
recursively through the subsets we create until leaf
nodes have been reached throughout and all subsets are
pure with zero entropy. This is exactly how ID3 and
other variants work.
Example 1
If S is a collection of 14 examples with 9 YES and 5 NO examples
then
Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) = 0.940
Notice entropy is 0 if all members of S belong to the same class (the
data is perfectly classified). The range of entropy is 0 ("perfectly
classified") to 1 ("totally random").
Gain(S, A) is information gain of example set S on attribute A is
defined as
Gain(S, A) = Entropy(S) - S ((|Sv| / |S|) * Entropy(Sv))
Where: S is each value v of all possible values of attribute A
Sv = subset of S for which attribute A has value v
|Sv| = number of elements in Sv
|S| = number of elements in S
Example 2. Suppose S is a set of 14 examples in which one of the
attributes is wind speed. The values of Wind can be Weak or Strong.
The classification of these 14 examples are 9 YES and 5 NO. For
attribute Wind, suppose there are 8 occurrences of Wind = Weak and 6
occurrences of Wind = Strong. For Wind = Weak, 6 of the examples are
YES and 2 are NO. For Wind = Strong, 3 are YES and 3 are NO.
Therefore Gain(S,Wind)=Entropy(S)-(8/14)*Entropy(Sweak)(6/14)*Entropy(Sstrong)
= 0.940 - (8/14)*0.811 - (6/14)*1.00
= 0.048
Entropy(Sweak) = - (6/8)*log2(6/8) - (2/8)*log2(2/8) = 0.811
Entropy(Sstrong) = - (3/6)*log2(3/6) - (3/6)*log2(3/6) = 1.00
For each attribute, the gain is calculated and the highest gain is used in
the decision node.
Decision Tree Construction Algorithm (pseudo-code):
Input: A data set, S
Output: A decision tree
•If all the instances have the same value for the target attribute then
return a decision tree that is simply this value (not really a tree - more
of a stump).
•Else
1.Compute Gain values (see above) for all attributes and select an
attribute with the lowest value and create a node for that attribute.
2.Make a branch from this node for every value of the attribute
3.Assign all possible values of the attribute to branches.
4.Follow each branch by partitioning the dataset to be only instances
whereby the value of the branch is present and then go back to 1.
Decision Tree Example
Outlook
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
Outlook
sunny
overcast
Humidity
Yes
rain
Windy
high
normal
true
false
No
Yes
No
Yes
Which Attributes to Select?
A Criterion for Attribute Selection
Which is the best attribute?
The one which will
result in the smallest tree
Heuristic: choose the
attribute that produces
the “purest” nodes
Outlook
overcast
Yes
• Information gain:
(information before split) – (information after split)
• Information gain for attributes from weather
data:
Continuing to Split
gain(" Temperatur e" )  0.571 bits
gain(" Humidity" )  0.971 bits
gain(" Windy" )  0.020 bits
The Final Decision Tree
Splitting stops when data can’t be split any further
Person
Homer
Marge
Bart
Lisa
Maggie
Abe
Selma
Otto
Krusty
Comic
Hair
Length
Weight
Age
Class
0”
10”
2”
6”
4”
1”
8”
10”
6”
250
150
90
78
20
170
160
180
200
36
34
10
8
1
70
41
38
45
M
F
M
F
F
M
F
M
M
8”
290
38
?
Entropy( S )  
 p 
p
 
log 2 
pn
p

n


 n 
n

log 2 
pn
p

n


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
yes
no
Hair Length <= 5?
Let us try splitting
on Hair length
Gain( A)  E(Current set )   E(all child sets)
Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
Entropy( S )  
 p 
p
 
log 2 
pn
p

n


 n 
n

log 2 
pn
p

n


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
yes
no
Weight <= 160?
Let us try splitting
on Weight
Gain( A)  E(Current set )   E(all child sets)
Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
Entropy( S )  
 p 
p
 
log 2 
pn
p

n


 n 
n

log 2 
pn
p

n


Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
yes
no
age <= 40?
Let us try splitting
on Age
Gain( A)  E(Current set )   E(all child sets)
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Of the 3 features we had, Weight
was best. But while people who
weigh over 160 are perfectly
classified (as males), the under 160
people are not perfectly
classified… So we simply recurse!
This time we find that we
can split on Hair length, and
we are done!
yes
yes
no
Weight <= 160?
no
Hair Length <= 2?
We need don’t need to keep the data
around, just the test conditions.
Weight <= 160?
yes
How would
these people
be classified?
no
Hair Length <= 2?
yes
Male
no
Female
Male
It is trivial to convert Decision
Trees to rules…
Weight <= 160?
yes
Hair Length <= 2?
yes
Male
no
Female
Rules to Classify Males/Females
If Weight greater than 160, classify as Male
Elseif Hair Length less than or equal to 2, classify as Male
Else classify as Female
no
Male
Once we have learned the decision tree, we don’t even need a computer!
This decision tree is attached to a medical machine, and is designed to help
nurses make decisions about what type of doctor to call.
Decision tree for a typical shared-care setting applying
the system for the diagnosis of prostatic obstructions.
The worked examples we have
seen were performed on small
datasets. However with small
datasets there is a great danger of
overfitting the data…
When you have few datapoints,
there are many possible splitting
rules that perfectly classify the
data, but will not generalize to
future datasets.
Yes
No
Wears green?
Female
Male
For example, the rule “Wears green?” perfectly classifies the data, so does
“Mothers name is Jacqueline?”, so does “Has blue shoes”…
Download