Tutorial 2

SEEM4630 2013-2014 Tutorial 2 Classification: Definition  Given a collection of records (training set ), each record contains a set of attributes, one of the attributes is the class.  Find a model for class attribute as a function of the values of other attributes.  Decision tree  Naïve bayes  k-NN  Goal: previously unseen records should be assigned a class as accurately as possible. 2 Decision Tree  Goal  Construct a tree so that instances belonging to  different classes should be separated Basic algorithm (a greedy algorithm)  Tree is constructed in a top-down recursive manner  At start, all the training examples are at the root  Test attributes are selected on the basis of a heuristics or statistical measure (e.g., information gain)  Examples are partitioned recursively based on selected attributes 3 Attribute Selection Measure 1: Information Gain  Let pi be the probability that a tuple belongs to class Ci,  estimated by |Ci,D|/|D| Expected information (entropy) needed to classify a tuple in D: m Info ( D )    p i log 2 ( p i ) i 1  Information needed (after using A to split D into v partitions) to classify D: v Info A ( D )   j 1 | Dj | |D|  Info ( D j )  Information gained by branching on attribute A Gain(A)  Info(D)  Info A (D) 4 Attribute Selection Measure 2: Gain Ratio  Information gain measure is biased towards attributes with a large number of values  C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) v SplitInfo A (D)   | Dj | j 1 |D|  log 2 ( | Dj | ) |D|  GainRatio(A) = Gain(A)/SplitInfo(A) 5 Attribute Selection Measure 3: Gini index  If a data set D contains examples from n classes, gini index, gini(D) is defined as n gini ( D )  1   p 2j j 1  where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as gini A ( D )  | D1 | |D | gini ( D 1)  |D 2 | |D | gini ( D 2 )  Reduction in Impurity:  gini ( A )  gini ( D )  gini A ( D ) 6 Example Outlook Temperature Humidity Wind Play Tennis Sunny >25 High Weak No Sunny >25 High Strong No Overcast >25 High Weak Yes Rain 15-25 High Weak Yes Rain <15 Normal Weak Yes Rain <15 Normal Strong No Overcast <15 Normal Strong Yes Sunny 15-25 High Weak No Sunny <15 Normal Weak Yes Rain 15-25 Normal Weak Yes Sunny 15-25 Normal Strong Yes Overcast 15-25 High Strong Yes Overcast >25 Normal Weak Yes Rain 15-25 High Strong No 7 Tree induction example  Entropy of data S Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) = 0.94  Split data by attribute Outlook S[9+, 5-] Outlook Sunny [2+,3-] Overcast [4+,0-] Rain [3+,2-] Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))] – 4/14[-4/4(log2(4/4))-0/4(log2(0/4))] – 5/14[-3/5(log2(3/5))-2/5(log2(2/5))] = 0.94 – 0.69 = 0.25 8 Tree induction example  Split data by attribute Temperature S[9+, 5-] Temperature <15 [3+,1-] 15-25 [5+,1-] >25 [2+,2-] Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))] – 6/14[-5/6(log2(5/6))-1/6(log2(1/6))] – 4/14[-2/4(log2(2/4))-2/4(log2(2/4))] = 0.94 – 0.80 = 0.14 9 Tree induction example  Split data by attribute Humidity High [3+,4-] S[9+, 5-] Humidity Normal [6+, 1-] Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))] – 7/14[-6/7(log2(6/7))-1/7(log2(1/7))] = 0.94 – 0.79 = 0.15  Split data by attribute Wind Weak [6+, 2-] S[9+, 5-] Wind Strong [3+, 3-] Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))] – 6/14[-3/6(log2(3/6))-3/6(log2(3/6))] = 0.94 – 0.89 = 0.05 10 Tree induction example Outlook Temperat ure Humidity Wind Play Tennis Sunny >25 High Weak No Sunny >25 High Strong No Overcast >25 High Weak Yes Rain 15-25 High Weak Yes Rain <15 Normal Weak Yes Rain <15 Normal Strong No Overcast <15 Normal Strong Yes Sunny 15-25 High Weak No Sunny <15 Normal Weak Yes Rain 15-25 Normal Weak Yes Sunny 15-25 Normal Strong Yes Overcast 15-25 High Strong Yes Overcast >25 Normal Weak Yes Rain 15-25 High Strong No Gain(Outlook) = 0.25 Gain(Temperature)=0.14 Gain(Humidity) = 0.15 Gain(Wind) = 0.05 Outlook Sunny ?? Overcast Yes Rain ?? 11  Entropy of branch Sunny Info(Sunny) = -2/5(log2(2/5))-3/5(log2(3/5)) = 0.97  Split Sunny branch by attribute Temperature Sunny[2+,3-] Temperature <15 [1+,0-] 15-25 [1+,1-] >25 [0+,2-] Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))] = 0.97 – 0.4 = 0.57  Split Sunny branch by attribute Humidity Sunny[2+,3-] Humidity High [0+,3-] Normal [2+, 0-] Gain(Humidity) = 0.97 – 3/5[-0/3(log2(0/3))-3/3(log2(3/3))] – 2/5[-2/2(log2(2/2))-0/2(log2(0/2))] = 0.97 – 0 = 0.97  Split Sunny branch by attribute Wind Sunny[2+, 3-] Wind Weak [1+, 2-] Strong [1+, 1-] Gain(Wind) = 0.97 – 3/5[-1/3(log2(1/3))-2/3(log2(2/3))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] = 0.97 – 0.95= 0.02 12 Tree induction example Outlook Sunny Humidity High No Overcast Yes Rain ?? Normal Yes 13  Entropy of branch Rain Info(Rain) = -3/5(log2(3/5))-2/5(log2(2/5)) = 0.97  Split Rain branch by attribute Temperature Rain[3+,2-] Temperature <15 [1+,1-] 15-25 [2+,1-] >25 [0+,0-] Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] – 0/5[-0/0(log2(0/0))-0/0(log2(0/0))] = 0.97 – 0.95 = 0.02  Split Rain branch by attribute Humidity Rain[3+,2-] Humidity High [1+,1-] Normal [2+, 1-] Gain(Humidity) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] = 0.97 – 0.95 = 0.02  Split Rain branch by attribute Wind Rain[3+,2-] Wind Weak [3+, 0-] Strong [0+, 2-] Gain(Wind) = 0.97 – 3/5[-3/3(log2(3/3))-0/3(log2(0/3))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))] = 0.97 – 0 = 0.97 14 Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes Wind Weak Yes Strong No 15 Bayesian Classification  A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities  P ( C i | x1 , x 2 ,..., x n ) where xi is the value of attribute Ai  Choose the class label that has the highest probability  Foundation: Based on Bayes’ Theorem. P ( C i | x1 , x 2 ,..., x n )  P ( x1 , x 2 ,..., x n | C i ) P ( C i ) P ( x1 , x 2 ,..., x n )  P ( C i | x1 , x 2 ,..., x n ) posteriori probability  P ( C i ) prior probability  P ( x1 , x 2 ,..., x n | C i ) likelihood Model: compute from data 16 Naïve Bayes Classifier  Problem: joint probabilities are difficult to estimate P ( x1 , x 2 ,..., x n | C i )  Naïve Bayes Classifier  Assumption: attributes are conditionally independent P ( x1 , x 2 ,..., x n | C i )  P ( x1 | C i )  P ( x n | C i ) P ( C i | x1 , x 2 , ..., x n )   n j 1 P ( x j | C i )P ( C i ) P ( x1 , x 2 , ..., x n ) 17 Example: Naïve Bayes Classifier A B C m b t m s t g q t h s t g q t g q f g s f h b F h q f m b f P(C=t) = 1/2 P(C=f ) = 1/2 P(A=m|C=t) = 2/5 P(A=m|C=f ) = 1/5 P(B=q|C=t) = 2/5 P(B=q|C=f ) = 2/5 Test Record: A=m, B=q, C=? 18 Example: Naïve Bayes Classifier For C = t P(A=m|C=t) * P(B=q|C=t) * P(C=t) = 2/5 * 2/5 * 1/2 = 2/25 Higher! P(C=t|A=m, B=q) = (2/25) / P(A=m, B=q) For C = f P(A=m|C=f) * P(B=q|C=f) * P(C=f) = 1/5 * 2/5 * 1/2 = 1/25 P(C=t|A=m, B=q) = (1/25) / P(A=m, B=q) Conclusion: A=m, B=q, C=t 19 Nearest Neighbor Classification Input A set of stored records k: # of nearest neighbors Output  Compute distance: d ( p , q )   ( p  q )  Identify k nearest neighbors  Determine the class label of unknown record based on class labels 2 i i i of nearest neighbors (i.e. by taking majority vote) 20 Nearest Neighbor Classification A Discrete Example  Calculate the distances: Input Given 8 training instances P1 (4, 2)  Orange P2 (0.5, 2.5)  Orange P3 (2.5, 2.5)  Orange P4 (3, 3.5)  Orange P5 (5.5, 3.5)  Orange P6 (2, 4)  Black P7 (4, 5)  Black P8 (2.5, 5.5)  Black d(P1, Pn) = ( 4  4 ) d(P2, Pn) = 3.80 d(P3, Pn) = 2.12 d(P4, Pn) = 1.12 d(P5, Pn) = 1.58 d(P6, Pn) = 2 d(P7, Pn) = 1 d(P8, Pn) = 2.12 2  (2  4)  2 2 k=1&k=3 New Instance: Pn (4, 4)  ? 21 Nearest Neighbor Classification k=3 k=1 P8 P8 P7 P7 Pn Pn P6 P6 P2 P4 P3 P4 P5 P1 P2 P5 P3 P1 22 Nearest Neighbor Classification…  Scaling issues  Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes • Each attribute must follow in the same range • Min-Max normalization  Example: • Two data records: a = (1, 1000), b = (0.5, 1) • dis(a, b) = ? 23 Classification: Lazy & Eager Learning Two Types of Learning Methodologies  Lazy Learning  • Instance-based learning. (k-NN) Eager Learning • Decision-tree and Bayesian classification. • ANN & SVM P8 P8 P7 P7 Pn Pn P6 P6 P4 P2 P4 P5 P3 P1 P2 P5 P3 P1 24 Differences Between Lazy &Eager Learning  Lazy Learning a. Do not require model building b. Less time training but more time predicting c. Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function  Eager Learning a. Require model building b. More time training but less time predicting c. Must commit to a single hypothesis that covers the entire instance space 25 26

Tutorial 2

Related documents

Products

Support

Tutorial 2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib