CS4445 B12 Decision Trees Homework 2 solutions by Ken Loomis

advertisement
CS4445/B12
Provided by: Kenneth J. Loomis
genre
comedy
comedy
comedy
action
action
comedy
comedy
drama
drama
drama
drama
action
action
action
critics-reviews
thumbs-up
thumbs-up
neutral
thumbs-down
neutral
thumbs-down
neutral
thumbs-up
thumbs-down
neutral
thumbs-up
neutral
thumbs-down
neutral
rating
R
R
R
PG-13
R
PG-13
PG-13
R
PG-13
R
PG-13
R
PG-13
PG-13
IMAX
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
TRUE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
FALSE
likes
no
no
no
no
no
yes
yes
yes
yes
yes
yes
yes
yes
yes
Entropy (target attribute)
14
=
14
14
=
∗ πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ [5,9]
14
5
5
9
9
− log 2
+ − log 2
14
14
14
14
= .9403
genre
=comedy
=drama
5
5
Entropy 2,3 +
Entropy 3,2
14
14
5
2
2
3
3
IMAX
likes
=
− log 2
+ − log 2
14
5
5
5
5
TRUE
no
TRUE
no
5
3
3
2
2
FALSE
yes
+
− log 2
+ − log 2
FALSE
yes
14
5
5
5
5
FALSE
yes
4
0
0
4
4
FALSE
no
+
−
log
+
−
log
2
2
TRUE
no
4
4
4
4
4
FALSE
no
=.6935
FALSE
yes
Entropy (genre) =
genre
action
action
action
action
action
comedy
comedy
comedy
comedy
comedy
drama
drama
drama
drama
critics-reviews
thumbs-down
neutral
neutral
thumbs-down
neutral
thumbs-up
thumbs-up
neutral
thumbs-down
neutral
thumbs-up
thumbs-down
neutral
thumbs-up
rating
PG-13
R
R
PG-13
PG-13
R
R
R
PG-13
PG-13
R
PG-13
R
PG-13
=action
TRUE
FALSE
TRUE
TRUE
FALSE
yes
yes
yes
yes
yes
criticsreviews
=neutral
=thumbs-down
=thumbs-up
6
4
4
Entropy (critics−reviews) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 2,4 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 1,3 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦[2,2]
14
14
14
6
2
2
4
4
genre
critics-reviews
rating
IMAX
likes
=
−
log
+
−
log
2
2
action
neutral
R
TRUE
no
14
6
6
6
6
comedy
neutral
R
FALSE
no
4
1
1
3
3
action
neutral
R
FALSE
yes
+
− log 2
+ − log 2
action
neutral
PG-13
FALSE
yes
14
4
4
4
4
comedy
neutral
PG-13
TRUE
yes
drama
neutral
R
TRUE
yes
4
2
2
2
2
+
− log 2
+ − log 2
action
thumbs-down
PG-13
TRUE
no
14
4
4
4
4
action
thumbs-down
PG-13
FALSE
yes
comedy
thumbs-down
PG-13
FALSE
yes
=.9111
drama
comedy
comedy
drama
drama
thumbs-down
thumbs-up
thumbs-up
thumbs-up
thumbs-up
PG-13
R
R
R
PG-13
TRUE
FALSE
TRUE
FALSE
FALSE
yes
no
no
yes
yes
rating
=PG-13
=R
7
7
πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 1,6 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦[4,3]
14
14
7
1
1
6
6
=
− log 2
+ − log 2
14
7
7
7
7
7
4
4
3
3
+
− log 2 + − log 2
14
7
7
7
7
=.7885
Entropy (rating) =
genre
action
action
comedy
action
comedy
drama
drama
action
comedy
comedy
comedy
action
drama
drama
critics-reviews
thumbs-down
neutral
neutral
thumbs-down
thumbs-down
thumbs-down
thumbs-up
neutral
neutral
thumbs-up
thumbs-up
neutral
neutral
thumbs-up
rating
PG-13
PG-13
PG-13
PG-13
PG-13
PG-13
PG-13
R
R
R
R
R
R
R
IMAX
TRUE
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
TRUE
FALSE
FALSE
TRUE
FALSE
TRUE
FALSE
likes
no
yes
yes
yes
yes
yes
yes
no
no
no
no
yes
yes
yes
IMAX
=TRUE
genre
comedy
comedy
action
action
comedy
drama
action
drama
action
action
comedy
comedy
drama
drama
critics-reviews
neutral
thumbs-up
neutral
thumbs-down
thumbs-down
thumbs-up
neutral
thumbs-up
thumbs-down
neutral
thumbs-up
neutral
thumbs-down
neutral
rating
R
R
PG-13
PG-13
PG-13
PG-13
R
R
PG-13
R
R
PG-13
PG-13
R
=FALSE
8
6
Entropy (IMAX) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 2,6 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦[3,3]
14
14
8
2
2
6
6
IMAX
likes
=
−
log
+
−
log
2
FALSE
no
14
8
8
8 28
FALSE
no
6
3
3
3
3
FALSE
yes
+
−
log
+
−
log
2
2
FALSE
yes
14
6
6
6
6
FALSE
yes
FALSE
yes
=.8922
FALSE
FALSE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
yes
yes
no
no
no
yes
yes
yes
genre
=comedy
Entropy (genre)
Entropy (critics−reviews)
Entropy (rating)
Entropy (IMAX)
=.6935
=.9111
=.7885
=.8922
=drama
=action
• We can see that genre
provides us with the lowest
entropy, thus it becomes the
root node of our ID3 tree.
genre
=comedy
=drama
=action
?
We now move on to the left
child node of our tree. What
attribute do we choose for this
node?
Options:
critics-reviews
rating
IMAX
genre
=comedy
=drama
=action
criticsreviews
=neutral
=thumbs-down
=thumbs-up
2
1
2
Entropy (critics−reviews) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 1,1 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 0,1 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦[2,0]
5
5
5
2
1
1
1
1
=
− log 2
+ − log 2
genre
critics-reviews
rating
IMAX
likes
5
2
2
2
2
comedy
neutral
R
FALSE
no
1
0
0
1
1
comedy
neutral
PG-13
TRUE
yes
+
−
log
+
−
log
2
2
comedy
thumbs-down
PG-13
FALSE
yes
5
1
1
1
1
comedy
thumbs-up
R
FALSE
no
2
2
2
0
0
comedy
thumbs-up
R
TRUE
no
+
− log 2
+ − log 2
5
2
2
2
2
=.4000
genre
=comedy
=drama
=action
rating
=PG-13
genre
comedy
comedy
comedy
comedy
comedy
=R
critics-reviews
neutral
thumbs-down
neutral
thumbs-up
thumbs-up
rating
PG-13
PG-13
R
R
R
2
3
Entropy (rating) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 0,2 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦[3,0]
5
5
2
0
0
2
2
IMAX
likes
=
−
log
+
−
log
2
2
TRUE
yes
5
2
2
2
2
FALSE
yes
3
3
3
0
0
FALSE
no
+
−
log
+
−
log
FALSE
no
2
2
5
3
3
3
3
TRUE
no
= 0.0
genre
=comedy
=drama
=action
IMAX
=PG-13
genre
comedy
comedy
comedy
comedy
comedy
critics-reviews
neutral
thumbs-up
thumbs-down
thumbs-up
neutral
=R
rating
R
R
PG-13
R
PG-13
3
2
Entropy (IMAX) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 2,1 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦[1,1]
5
5
3
2
2
1
1
IMAX
likes
=
−
log
+
−
log
2
2
FALSE
no
5
3
3
3
3
FALSE
no
2
1
1
1
1
FALSE
yes
+
−
log
+
−
log
TRUE
no
2
2
5
2
2
2
2
TRUE
yes
= .9510
genre
=comedy
=drama
=action
rating
=PG-13
=R
Entropy (critics−reviews) =.4000
Entropy (rating) = 0.0
Entropy (IMAX) = .9510
• We can see that rating
provides us with the lowest
entropy, thus it becomes the
left child node of our ID3
tree.
genre
=comedy
=drama
=action
rating
=PG-13
=R
[yes]
genre
comedy
comedy
comedy
comedy
comedy
[no]
critics-reviews
neutral
thumbs-down
neutral
thumbs-up
thumbs-up
rating
PG-13
PG-13
R
R
R
IMAX
TRUE
FALSE
FALSE
FALSE
TRUE
likes
yes
yes
no
no
no
• This also makes this split
homogeneous so we can
add our leaf nodes here.
genre
=comedy
=drama
rating
[yes]
=PG-13
[yes]
genre
drama
drama
drama
drama
critics-reviews
thumbs-up
thumbs-down
neutral
thumbs-up
=action
=R
[no]
rating
R
PG-13
R
PG-13
IMAX
FALSE
TRUE
TRUE
FALSE
likes
yes
yes
yes
yes
• We can see that genre = drama
provides us with a homogeneous
sub-set, so we can provide a leaf
node here.
genre
=comedy
=drama
[yes]
rating
=PG-13
[yes]
=action
?
=R
[no]
We now move on to the right
child node of our tree. What
attribute do we choose for this
node?
Options:
critics-reviews
rating
IMAX
genre
=comedy
=action
[yes]
rating
=PG-13
=drama
=R
=neutral
Criticsreviews
=thumbs-up
=thumbs-down
[yes]
genre
action
action
action
action
action
[no]
3
2
Entropy (critics−reviews) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 1,2 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 1,1 + 0
5
5
3
1
1
2
2
critics-reviews
rating
IMAX
likes
=
− log 2
+ − log 2
5
3
3
3
3
neutral
R
TRUE
no
neutral
R
FALSE
yes
2
1
1
1
1
neutral
PG-13
FALSE
yes
+
− log 2
+ − log 2
thumbs-down
PG-13
TRUE
no
5
2
2
2
2
thumbs-down
PG-13
FALSE
yes
+ 0 = .9510
genre
=comedy
[yes]
genre
action
action
action
action
action
critics-reviews
thumbs-down
neutral
thumbs-down
neutral
neutral
=action
[yes]
rating
=PG-13
=drama
=R
rating
=PG-13
=R
[no]
rating
PG-13
PG-13
PG-13
R
R
3
2
Entropy (rating) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 1,2 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 1,1
5
5
3
1
1
2
2
IMAX
likes
=
− log 2
+ − log 2
5
3
3
3
3
TRUE
no
FALSE
yes
2
1
1
1
1
FALSE
yes
+
− log 2
+ − log 2
TRUE
no
5
2
2
2
2
FALSE
yes
= .9510
genre
=comedy
[yes]
genre
action
action
action
action
action
critics-reviews
neutral
thumbs-down
neutral
thumbs-down
neutral
=action
[yes]
rating
=PG-13
=drama
=R
IMAX
=FALSE
=TRUE
[no]
rating
PG-13
PG-13
R
PG-13
R
3
2
Entropy (IMAX) = πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 0,3 + πΈπ‘›π‘‘π‘Ÿπ‘œπ‘π‘¦ 2,0
5
5
3
0
0
3
3
IMAX
likes
=
− log 2
+ − log 2
5
3
3
3
3
FALSE
yes
FALSE
yes
2
2
2
0
0
FALSE
yes
+
− log 2
+ − log 2
TRUE
no
5
2
2
2
2
TRUE
no
= 0.0
genre
=comedy
=drama
[yes]
rating
=PG-13
[yes]
=action
=R
IMAX
=FALSE
=TRUE
[no]
Entropy (critics-reviews) = .9510
Entropy (rating) = .9510
Entropy (IMAX) = 0.0
• We can see that IMAX
provides us with the lowest
entropy, thus it becomes the
right child node of our ID3
tree.
genre
=comedy
=drama
[yes]
rating
=PG-13
[yes]
genre
action
action
action
action
action
critics-reviews
neutral
thumbs-down
neutral
thumbs-down
neutral
=R
[no]
rating
PG-13
PG-13
R
PG-13
R
=action
=FALSE
[yes]
IMAX
FALSE
FALSE
FALSE
TRUE
TRUE
likes
yes
yes
yes
no
no
IMAX
=TRUE
[no]
• This also makes this split
homogeneous so we can
add our leaf nodes here.
genre
=comedy
[yes]
rating
=PG-13
[yes]
=drama
=R
[no]
=action
=FALSE
IMAX
[yes]
• Since we have only leaf nodes remaining
we are finished building our tree.
=TRUE
[no]
genre
=comedy
[yes]
rating
=PG-13
[yes]
=drama
=R
[no]
• How can we handle
missing values
using this decision
tree?
=action
=FALSE
[yes]
IMAX
=TRUE
[no]
• Given an instance:
• Genre = action
• Critics-reviews = ?
• Rating = R
• IMAX = ?
How do we classify it?
• Consider adding frequency counts to each leaf node:
shown here in curly braces.
genre
=comedy
[yes] {4}
rating
=PG-13
[yes] {2}
=drama
=R
[no] {3}
=action
=FALSE
[yes] {3}
IMAX
=TRUE
[no] {2}
genre
=comedy
[yes] {4}
rating
=PG-13
[yes] {2}
•
•
•
•
=drama
=R
[no] {3}
Genre = action
Critics-reviews = ?
Rating = R
IMAX = ?
=action
=FALSE
[yes] {3}
• Traverse the tree.
IMAX
=TRUE
[no] {2}
genre
=comedy
[yes] {4}
rating
=PG-13
[yes] {2}
•
•
•
•
=drama
=R
[no] {3}
Genre = action
Critics-reviews = ?
Rating = R
IMAX = ?
=action
=FALSE
[yes] {3}
IMAX
=TRUE
[no] {2}
• Traverse the decision tree normally
when the attribute value is known.
genre
=comedy
[yes] {4}
rating
=PG-13
[yes] {2}
•
•
•
•
=drama
=R
[no] {3}
Genre = action
Critics-reviews = ?
Rating = R
IMAX = ?
=action
=FALSE
[yes] {3}
IMAX
=TRUE
[no] {2}
• Traverse every possible path when a
missing value is encountered.
genre
=comedy
[yes] {4}
rating
=PG-13
[yes] {2}
•
•
•
•
=drama
=action
=R
[no] {3}
Genre = action
Critics-reviews = ?
Rating = R
IMAX = ?
=FALSE
[yes] {3}
IMAX
=TRUE
[no] {2}
• Traverse every possible path when a
missing value is encountered.
• Sum the frequency counts of all like
leaf nodes that are reached:
Freq 𝑦𝑒𝑠 = 3
Freq π‘›π‘œ = 2
genre
=comedy
[yes] {4}
rating
=PG-13
[yes] {2}
•
•
•
•
•
=drama
=R
[no] {3}
Genre = action
Critics-reviews = ?
Rating = R
IMAX = ?
like = yes
=action
=FALSE
[yes] {3}
IMAX
=TRUE
[no] {2}
• Follow every possible path when a
missing value is encountered.
• Determine the frequency count by
summing like classification frequencies:
Freq 𝑦𝑒𝑠 = 3
Freq π‘›π‘œ = 2
• Classify based on the highest
frequency count.
genre
=comedy
[yes] {4}
rating
=PG-13
[yes] {2}
•
•
•
•
•
=drama
=R
[no] {3}
Genre = ?
Critics-reviews = ?
Rating = R
IMAX = TRUE
like = no
=action
=FALSE
[yes] {3}
IMAX
=TRUE
[no] {2}
• Consider this 2nd example:
Freq 𝑦𝑒𝑠 = 4
Freq π‘›π‘œ = 3 + 2 = 5
genre
=comedy
[yes] {4}
rating
=PG-13
[yes] {2}
•
•
•
•
•
=drama
=R
[no] {3}
Genre = ?
Critics-reviews = ?
Rating = ?
IMAX = ?
likes = yes
=action
=FALSE
[yes] {3}
IMAX
=TRUE
[no] {2}
• Consider if all attribute values are
unknown:
Freq 𝑦𝑒𝑠 = 2 + 4 + 3 = 9
Freq π‘›π‘œ = 3 + 2 = 5
Download