CS4445/B12 Provided by: Kenneth J. Loomis genre comedy comedy comedy action action comedy comedy drama drama drama drama action action action critics-reviews thumbs-up thumbs-up neutral thumbs-down neutral thumbs-down neutral thumbs-up thumbs-down neutral thumbs-up neutral thumbs-down neutral rating R R R PG-13 R PG-13 PG-13 R PG-13 R PG-13 R PG-13 PG-13 IMAX FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE likes no no no no no yes yes yes yes yes yes yes yes yes Entropy (target attribute) 14 = 14 14 = ∗ πΈππ‘ππππ¦ [5,9] 14 5 5 9 9 − log 2 + − log 2 14 14 14 14 = .9403 genre =comedy =drama 5 5 Entropy 2,3 + Entropy 3,2 14 14 5 2 2 3 3 IMAX likes = − log 2 + − log 2 14 5 5 5 5 TRUE no TRUE no 5 3 3 2 2 FALSE yes + − log 2 + − log 2 FALSE yes 14 5 5 5 5 FALSE yes 4 0 0 4 4 FALSE no + − log + − log 2 2 TRUE no 4 4 4 4 4 FALSE no =.6935 FALSE yes Entropy (genre) = genre action action action action action comedy comedy comedy comedy comedy drama drama drama drama critics-reviews thumbs-down neutral neutral thumbs-down neutral thumbs-up thumbs-up neutral thumbs-down neutral thumbs-up thumbs-down neutral thumbs-up rating PG-13 R R PG-13 PG-13 R R R PG-13 PG-13 R PG-13 R PG-13 =action TRUE FALSE TRUE TRUE FALSE yes yes yes yes yes criticsreviews =neutral =thumbs-down =thumbs-up 6 4 4 Entropy (critics−reviews) = πΈππ‘ππππ¦ 2,4 + πΈππ‘ππππ¦ 1,3 + πΈππ‘ππππ¦[2,2] 14 14 14 6 2 2 4 4 genre critics-reviews rating IMAX likes = − log + − log 2 2 action neutral R TRUE no 14 6 6 6 6 comedy neutral R FALSE no 4 1 1 3 3 action neutral R FALSE yes + − log 2 + − log 2 action neutral PG-13 FALSE yes 14 4 4 4 4 comedy neutral PG-13 TRUE yes drama neutral R TRUE yes 4 2 2 2 2 + − log 2 + − log 2 action thumbs-down PG-13 TRUE no 14 4 4 4 4 action thumbs-down PG-13 FALSE yes comedy thumbs-down PG-13 FALSE yes =.9111 drama comedy comedy drama drama thumbs-down thumbs-up thumbs-up thumbs-up thumbs-up PG-13 R R R PG-13 TRUE FALSE TRUE FALSE FALSE yes no no yes yes rating =PG-13 =R 7 7 πΈππ‘ππππ¦ 1,6 + πΈππ‘ππππ¦[4,3] 14 14 7 1 1 6 6 = − log 2 + − log 2 14 7 7 7 7 7 4 4 3 3 + − log 2 + − log 2 14 7 7 7 7 =.7885 Entropy (rating) = genre action action comedy action comedy drama drama action comedy comedy comedy action drama drama critics-reviews thumbs-down neutral neutral thumbs-down thumbs-down thumbs-down thumbs-up neutral neutral thumbs-up thumbs-up neutral neutral thumbs-up rating PG-13 PG-13 PG-13 PG-13 PG-13 PG-13 PG-13 R R R R R R R IMAX TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE likes no yes yes yes yes yes yes no no no no yes yes yes IMAX =TRUE genre comedy comedy action action comedy drama action drama action action comedy comedy drama drama critics-reviews neutral thumbs-up neutral thumbs-down thumbs-down thumbs-up neutral thumbs-up thumbs-down neutral thumbs-up neutral thumbs-down neutral rating R R PG-13 PG-13 PG-13 PG-13 R R PG-13 R R PG-13 PG-13 R =FALSE 8 6 Entropy (IMAX) = πΈππ‘ππππ¦ 2,6 + πΈππ‘ππππ¦[3,3] 14 14 8 2 2 6 6 IMAX likes = − log + − log 2 FALSE no 14 8 8 8 28 FALSE no 6 3 3 3 3 FALSE yes + − log + − log 2 2 FALSE yes 14 6 6 6 6 FALSE yes FALSE yes =.8922 FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE yes yes no no no yes yes yes genre =comedy Entropy (genre) Entropy (critics−reviews) Entropy (rating) Entropy (IMAX) =.6935 =.9111 =.7885 =.8922 =drama =action • We can see that genre provides us with the lowest entropy, thus it becomes the root node of our ID3 tree. genre =comedy =drama =action ? We now move on to the left child node of our tree. What attribute do we choose for this node? Options: critics-reviews rating IMAX genre =comedy =drama =action criticsreviews =neutral =thumbs-down =thumbs-up 2 1 2 Entropy (critics−reviews) = πΈππ‘ππππ¦ 1,1 + πΈππ‘ππππ¦ 0,1 + πΈππ‘ππππ¦[2,0] 5 5 5 2 1 1 1 1 = − log 2 + − log 2 genre critics-reviews rating IMAX likes 5 2 2 2 2 comedy neutral R FALSE no 1 0 0 1 1 comedy neutral PG-13 TRUE yes + − log + − log 2 2 comedy thumbs-down PG-13 FALSE yes 5 1 1 1 1 comedy thumbs-up R FALSE no 2 2 2 0 0 comedy thumbs-up R TRUE no + − log 2 + − log 2 5 2 2 2 2 =.4000 genre =comedy =drama =action rating =PG-13 genre comedy comedy comedy comedy comedy =R critics-reviews neutral thumbs-down neutral thumbs-up thumbs-up rating PG-13 PG-13 R R R 2 3 Entropy (rating) = πΈππ‘ππππ¦ 0,2 + πΈππ‘ππππ¦[3,0] 5 5 2 0 0 2 2 IMAX likes = − log + − log 2 2 TRUE yes 5 2 2 2 2 FALSE yes 3 3 3 0 0 FALSE no + − log + − log FALSE no 2 2 5 3 3 3 3 TRUE no = 0.0 genre =comedy =drama =action IMAX =PG-13 genre comedy comedy comedy comedy comedy critics-reviews neutral thumbs-up thumbs-down thumbs-up neutral =R rating R R PG-13 R PG-13 3 2 Entropy (IMAX) = πΈππ‘ππππ¦ 2,1 + πΈππ‘ππππ¦[1,1] 5 5 3 2 2 1 1 IMAX likes = − log + − log 2 2 FALSE no 5 3 3 3 3 FALSE no 2 1 1 1 1 FALSE yes + − log + − log TRUE no 2 2 5 2 2 2 2 TRUE yes = .9510 genre =comedy =drama =action rating =PG-13 =R Entropy (critics−reviews) =.4000 Entropy (rating) = 0.0 Entropy (IMAX) = .9510 • We can see that rating provides us with the lowest entropy, thus it becomes the left child node of our ID3 tree. genre =comedy =drama =action rating =PG-13 =R [yes] genre comedy comedy comedy comedy comedy [no] critics-reviews neutral thumbs-down neutral thumbs-up thumbs-up rating PG-13 PG-13 R R R IMAX TRUE FALSE FALSE FALSE TRUE likes yes yes no no no • This also makes this split homogeneous so we can add our leaf nodes here. genre =comedy =drama rating [yes] =PG-13 [yes] genre drama drama drama drama critics-reviews thumbs-up thumbs-down neutral thumbs-up =action =R [no] rating R PG-13 R PG-13 IMAX FALSE TRUE TRUE FALSE likes yes yes yes yes • We can see that genre = drama provides us with a homogeneous sub-set, so we can provide a leaf node here. genre =comedy =drama [yes] rating =PG-13 [yes] =action ? =R [no] We now move on to the right child node of our tree. What attribute do we choose for this node? Options: critics-reviews rating IMAX genre =comedy =action [yes] rating =PG-13 =drama =R =neutral Criticsreviews =thumbs-up =thumbs-down [yes] genre action action action action action [no] 3 2 Entropy (critics−reviews) = πΈππ‘ππππ¦ 1,2 + πΈππ‘ππππ¦ 1,1 + 0 5 5 3 1 1 2 2 critics-reviews rating IMAX likes = − log 2 + − log 2 5 3 3 3 3 neutral R TRUE no neutral R FALSE yes 2 1 1 1 1 neutral PG-13 FALSE yes + − log 2 + − log 2 thumbs-down PG-13 TRUE no 5 2 2 2 2 thumbs-down PG-13 FALSE yes + 0 = .9510 genre =comedy [yes] genre action action action action action critics-reviews thumbs-down neutral thumbs-down neutral neutral =action [yes] rating =PG-13 =drama =R rating =PG-13 =R [no] rating PG-13 PG-13 PG-13 R R 3 2 Entropy (rating) = πΈππ‘ππππ¦ 1,2 + πΈππ‘ππππ¦ 1,1 5 5 3 1 1 2 2 IMAX likes = − log 2 + − log 2 5 3 3 3 3 TRUE no FALSE yes 2 1 1 1 1 FALSE yes + − log 2 + − log 2 TRUE no 5 2 2 2 2 FALSE yes = .9510 genre =comedy [yes] genre action action action action action critics-reviews neutral thumbs-down neutral thumbs-down neutral =action [yes] rating =PG-13 =drama =R IMAX =FALSE =TRUE [no] rating PG-13 PG-13 R PG-13 R 3 2 Entropy (IMAX) = πΈππ‘ππππ¦ 0,3 + πΈππ‘ππππ¦ 2,0 5 5 3 0 0 3 3 IMAX likes = − log 2 + − log 2 5 3 3 3 3 FALSE yes FALSE yes 2 2 2 0 0 FALSE yes + − log 2 + − log 2 TRUE no 5 2 2 2 2 TRUE no = 0.0 genre =comedy =drama [yes] rating =PG-13 [yes] =action =R IMAX =FALSE =TRUE [no] Entropy (critics-reviews) = .9510 Entropy (rating) = .9510 Entropy (IMAX) = 0.0 • We can see that IMAX provides us with the lowest entropy, thus it becomes the right child node of our ID3 tree. genre =comedy =drama [yes] rating =PG-13 [yes] genre action action action action action critics-reviews neutral thumbs-down neutral thumbs-down neutral =R [no] rating PG-13 PG-13 R PG-13 R =action =FALSE [yes] IMAX FALSE FALSE FALSE TRUE TRUE likes yes yes yes no no IMAX =TRUE [no] • This also makes this split homogeneous so we can add our leaf nodes here. genre =comedy [yes] rating =PG-13 [yes] =drama =R [no] =action =FALSE IMAX [yes] • Since we have only leaf nodes remaining we are finished building our tree. =TRUE [no] genre =comedy [yes] rating =PG-13 [yes] =drama =R [no] • How can we handle missing values using this decision tree? =action =FALSE [yes] IMAX =TRUE [no] • Given an instance: • Genre = action • Critics-reviews = ? • Rating = R • IMAX = ? How do we classify it? • Consider adding frequency counts to each leaf node: shown here in curly braces. genre =comedy [yes] {4} rating =PG-13 [yes] {2} =drama =R [no] {3} =action =FALSE [yes] {3} IMAX =TRUE [no] {2} genre =comedy [yes] {4} rating =PG-13 [yes] {2} • • • • =drama =R [no] {3} Genre = action Critics-reviews = ? Rating = R IMAX = ? =action =FALSE [yes] {3} • Traverse the tree. IMAX =TRUE [no] {2} genre =comedy [yes] {4} rating =PG-13 [yes] {2} • • • • =drama =R [no] {3} Genre = action Critics-reviews = ? Rating = R IMAX = ? =action =FALSE [yes] {3} IMAX =TRUE [no] {2} • Traverse the decision tree normally when the attribute value is known. genre =comedy [yes] {4} rating =PG-13 [yes] {2} • • • • =drama =R [no] {3} Genre = action Critics-reviews = ? Rating = R IMAX = ? =action =FALSE [yes] {3} IMAX =TRUE [no] {2} • Traverse every possible path when a missing value is encountered. genre =comedy [yes] {4} rating =PG-13 [yes] {2} • • • • =drama =action =R [no] {3} Genre = action Critics-reviews = ? Rating = R IMAX = ? =FALSE [yes] {3} IMAX =TRUE [no] {2} • Traverse every possible path when a missing value is encountered. • Sum the frequency counts of all like leaf nodes that are reached: Freq π¦ππ = 3 Freq ππ = 2 genre =comedy [yes] {4} rating =PG-13 [yes] {2} • • • • • =drama =R [no] {3} Genre = action Critics-reviews = ? Rating = R IMAX = ? like = yes =action =FALSE [yes] {3} IMAX =TRUE [no] {2} • Follow every possible path when a missing value is encountered. • Determine the frequency count by summing like classification frequencies: Freq π¦ππ = 3 Freq ππ = 2 • Classify based on the highest frequency count. genre =comedy [yes] {4} rating =PG-13 [yes] {2} • • • • • =drama =R [no] {3} Genre = ? Critics-reviews = ? Rating = R IMAX = TRUE like = no =action =FALSE [yes] {3} IMAX =TRUE [no] {2} • Consider this 2nd example: Freq π¦ππ = 4 Freq ππ = 3 + 2 = 5 genre =comedy [yes] {4} rating =PG-13 [yes] {2} • • • • • =drama =R [no] {3} Genre = ? Critics-reviews = ? Rating = ? IMAX = ? likes = yes =action =FALSE [yes] {3} IMAX =TRUE [no] {2} • Consider if all attribute values are unknown: Freq π¦ππ = 2 + 4 + 3 = 9 Freq ππ = 3 + 2 = 5