Jackknife Estimator: Example 1 Data Mining n CS 341, Spring 2007 n n n Lecture 8: Decision tree algorithms n Estimate of mean for X={x1, x2, x3,}, n =3, g=3, m=1, θ = µ = (x (x1+ x2+ x3)/3 θ1 = (x (x2 + x3)/2, θ2 = (x (x1 + x3)/2, θ1 = (x (x1 + x2)/2, θ_ = (θ (θ1 + θ2 + θ2)/3 θQ = gθ gθ-(g(g-1) θ_= 3θ 3θ-(3(3-1) θ_= (x (x1 + x2 + x3)/3 In this case, the Jackknife Estimator is the same as the usual estimator. © Prentice Hall Jackknife Estimator: Example 2 n n n n n n Jackknife Estimator: Example 2(cont’ 2(cont’d) Estimate of variance for X={1, 4, 4}, n =3, g=3, m=1, θ = σ2 σ2 = ((1((1-3)2 +(4+(4-3)2 +(4+(4-3)2 )/3 = 2 θ1 = ((4((4-4)2 + (4(4-4)2 ) /2 = 0, 0, θ2 = 2.25 , θ3 = 2.25 θ_ = (θ1 + θ2 + θ2)/3 = 1.5 θQ = gθ-(g(g-1) θ_= 3θ 3θ-(3(3-1) θ_ n n In this case, the Jackknife Estimator is different from the usual estimator. © Prentice Hall 3 Review: DistanceDistance-based Algorithms n n n n n Place items in class to which they are “closest” closest”. Similarity measures or distance measures Simple approach K Nearest Neighbors Decision Tree issues, pros and cons 4 Classification Using Decision Trees n n n n 5 then the jackknife estimator is s2 s2 = Σ (xi – x )2 / (n -1) Which is known to be unbiased for σ2 © Prentice Hall n © Prentice Hall In general, apply the Jackknife technique to the biased estimator σ2 σ2 = Σ (xi – x )2 / n =3(2)=3(2)-2(1.5)=3 n 2 Partitioning based: Divide search space into rectangular regions. Tuple placed into class based on the region within which it falls. DT approaches differ in how the tree is built: DT Induction Internal nodes associated with attribute and arcs with values for that attribute. Algorithms: ID3, C4.5, CART © Prentice Hall 6 1 DT Induction Decision Tree Given: – D = {t1, …, tn} where ti=<ti1, …, tih> – Database schema contains {A1, A2, …, Ah} – Classes C={C1, …., Cm} Decision or Classification Tree is a tree associated with D such that – Each internal node is labeled with attribute, Ai – Each arc is labeled with predicate which can be applied to attribute at parent – Each leaf node is labeled with a class, Cj © Prentice Hall 7 © Prentice Hall 8 Information Decision Tree Induction is often based on Information Theory So © Prentice Hall 9 © Prentice Hall DT Induction n n Information/Entropy When all the marbles in the bowl are mixed up, little information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. n Given probabilities p1, p2, .., ps whose sum is 1, Entropy is defined as: n Entropy measures the amount of randomness or surprise or uncertainty. n Goal in classification – Its value is between 0 and 1. – Reaches the maximum when all the probabilities are the same. – no surprise – entropy = 0 Use this approach with DT Induction ! © Prentice Hall 10 11 © Prentice Hall 12 2 Entropy ID3 n n Creates tree using information theory concepts and tries to reduce expected number of comparison.. ID3 chooses split attribute with the highest information gain: – Information gain: the difference between how much information is needed to make a correct classification before the split versus how much information is needed after the split. log (1/p) H(p,1-p) © Prentice Hall 13 © Prentice Hall Height Example Data Nam e K r is t i n a J im M a g g ie M a r th a S te p h a n ie B ob K a th y D ave W o rth S te v e n D e b b ie Todd K im Amy W y n e tte G ender F M F F F M F M M M F M F F F H e ig h t 1 .6 m 2m 1 .9 m 1 .8 8 m 1 .7 m 1 .8 5 m 1 .6 m 1 .7 m 2 .2 m 2 .1 m 1 .8 m 1 .9 5 m 1 .9 m 1 .8 m 1 .7 5 m O u tp u t1 S h o rt T a ll M e d iu m M e d iu m S h o rt M e d iu m S h o rt S h o rt T a ll T a ll M e d iu m M e d iu m M e d iu m M e d iu m M e d iu m Information Gain O u tp u t2 M e d iu m M e d iu m T a ll T a ll M e d iu m M e d iu m M e d iu m M e d iu m T a ll T a ll M e d iu m M e d iu m T a ll M e d iu m M e d iu m © Prentice Hall n Choose gender as the split attribute – H(D): entropy before split – E(H(D)) : expected entropy after split – Information gain = n Choose height as the split attribute – H(D): entropy before split – E(H(D)) : expected entropy after split – Information gain = 15 © Prentice Hall ID3 Example (Output1) 16 ID3 Example (Output1) Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 n Gain using gender: – Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764 – Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392 – Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152 – Gain: 0.4384 – 0.34152 = 0.09688 n Gain using height: 0.4384 – (2/15)(0.301) = 0.3983 n Choose height as first splitting attribute n © Prentice Hall 14 17 Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 n n n n Gain using gender: 0.09688 Gain using height: 0.4384 – (2/15)(0.301) = 0.3983 Choose height as first splitting attribute © Prentice Hall 18 3 C4.5 n C4.5: Example ID3 favors attributes with large number of divisions n n Improved version of ID3: – Missing Data – Continuous Data – Pruning – Rules – GainRatio: GainRatio: Calculate the GainRatio for the gender split – Entropy associated with the split ignoring classes H(9/15, 6/15) = 0.292 – The GainRatio value for the gender attribute 0.09688/0.292 = 0.332 © Prentice Hall 19 © Prentice Hall C5.0 n CART A commercial version of C4.5 widely used in many data mining packages. n n n n Targeted toward use with large datasets. – Produce more accurate rules. – Improves on memory usage by 90% – Run much faster than C4.5 n n © Prentice Hall n PL,PR probability that a tuple in the training set will be on the left or right side of the tree. P(Cj|tL) , P(Cj|tR) :probability that a tuple is in class Cj and in the left (or right) subtree. subtree. © Prentice Hall 22 Scalable DT Techniques At the start, there are six choices for split point (right branch on equality): n ϕ(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224 ϕ(1.6) = 0 ϕ(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 ϕ(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 ϕ(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 ϕ(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32 n Create Binary Tree Uses entropy for best splitting attribute (as with ID3) Formula to choose split point, s, for node t: 21 CART Example n 20 SPRINT – Creation of DTs for large datasets. – Based on CART techniques Best split at 1.8 What is next? © Prentice Hall 23 © Prentice Hall 24 4 Next Lecture: n n RuleRule-based algorithms Combing techniques © Prentice Hall 25 5