3/5[-1/2(log 2 (1/2))

SEEM4630 2012-2013 Tutorial 2 – Classification Decision tree, Naïve Bayes & k-NN WANG Jing Classification: Definition  Given a collection of records (training set )   Find a model for class attribute as a function of the values of other attributes.   Each record contains a set of attributes, one of the attributes is the class. Decision tree, Naïve bayes & k-NN Goal: previously unseen records should be assigned a class as accurately as possible. 2 Decision Tree  Goal   Construct a tree so that instances belonging to different classes should be separated Basic algorithm (a greedy algorithm)     Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Test attributes are selected on the basis of a heuristics or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes 3 Attribute Selection Measure 1: Information Gain   Let pi be the probability that a tuple belongs to class Ci, estimated by |Ci,D|/|D| Expected information (entropy) needed to classify m a tuple in D: Info( D)   pi log2 ( pi ) i 1  Information needed (after using A to split D into v |D | j v partitions) to classify D: Info ( D)    Info(D j ) A j 1  | D| Information gained by branching on attribute A Gain(A) Info(D) InfoA(D) 4 Attribute Selection Measure 2: Gain Ratio  Information gain measure is biased towards attributes with a large number of values  C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) v |D | | Dj | j SplitInfoA ( D)    log2 ( ) | D| j 1 | D |  GainRatio(A) = Gain(A)/SplitInfo(A) 5 Attribute Selection Measure 3: Gini index   If a data set D contains examples from n classes, gini index, gini(D) is defined as n 2 gini(D) 1  p j j 1 where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as gini A ( D)   Reduction in Impurity: |D1| |D | gini ( D1)  2 gini ( D 2) |D| |D| gini( A)  gini(D)  giniA (D) 6 Example Outlook Temperature Humidity Wind Play Tennis Sunny >25 High Weak No Sunny >25 High Strong No Overcast >25 High Weak Yes Rain 15-25 High Weak Yes Rain <15 Normal Weak Yes Rain <15 Normal Strong No Overcast <15 Normal Strong Yes Sunny 15-25 High Weak No Sunny <15 Normal Weak Yes Rain 15-25 Normal Weak Yes Sunny 15-25 Normal Strong Yes Overcast 15-25 High Strong Yes Overcast >25 Normal Weak Yes Rain 15-25 High Strong No 7 Tree induction example S[9+, 5-] Outlook Sunny [2+,3-] Info(S) = -9/14(log2(9/14))-5/14(log2(5/14)) Overcast [4+,0-] = 0.94 Rain [3+,2-] Gain(Outlook) = 0.94 – 5/14[-2/5(log2(2/5))-3/5(log2(3/5))] – 4/14[-4/4(log2(4/4))-0/4(log2(0/4))] – 5/14[-3/5(log2(3/5))-2/5(log2(2/5))] = 0.94 – 0.69 = 0.25 S[9+, 5-] Temperature <15 [3+,1-] 15-25 [5+,1-] >25 [2+,2-] Gain(Temperature) = 0.94 – 4/14[-3/4(log2(3/4))-1/4(log2(1/4))] – 6/14[-5/6(log2(5/6))-1/6(log2(1/6))] – 4/14[-2/4(log2(2/4))-2/4(log2(2/4))] = 0.94 – 0.80 = 0.14 8 High [3+,4-] S[9+, 5-] Humidity Normal [6+, 1-] Gain(Humidity) = 0.94 – 7/14[-3/7(log2(3/7))-4/7(log2(4/7))] – 7/14[-6/7(log2(6/7))-1/7(log2(1/7))] = 0.94 – 0.79 = 0.15 Weak [6+, 2-] S[9+, 5-] Wind Strong [3+, 3-] Gain(Wind) = 0.94 – 8/14[-6/8(log2(6/8))-2/8(log2(2/8))] – 6/14[-3/6(log2(3/6))-3/6(log2(3/6))] = 0.94 – 0.89 = 0.05 9 Outlook Tempe rature Humidi ty Wind Play Tennis Sunny >25 High Weak No Sunny >25 High Strong No Overcast >25 High Weak Yes Rain 15-25 High Weak Yes Rain <15 Normal Weak Yes Rain <15 Normal Strong No Overcast <15 Normal Strong Yes Sunny 15-25 High Weak No Sunny <15 Normal Weak Yes Rain 15-25 Normal Weak Yes Sunny 15-25 Normal Strong Yes Overcast 15-25 High Strong Yes Overcast >25 Normal Weak Yes Rain High Strong No 15-25 Gain(Outlook) = 0.25 Gain(Temperature)=0.14 Gain(Humidity) = 0.15 Gain(Wind) = 0.05 Outlook Sunny ?? Overcast Rain Yes ?? 10 Sunny[2+,3-] Temperature <15 [1+,0-] 15-25 [1+,1-] >25 [0+,2-] Info(Sunny) = -2/5(log2(2/5)) -3/5(log2(3/5)) = 0.97 Gain(Temperature) = 0.97 – 1/5[-1/1(log2(1/1))-0/1(log2(0/1))] – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))] = 0.97 – 0.4 = 0.37 Sunny[2+,3-] Humidity High [0+,3-] Normal [2+, 0-] Sunny[2+, 3-] Wind Weak [1+, 2-] Strong [1+, 1-] Gain(Humidity) = 0.97 – 3/5[-0/3(log2(0/3))-3/3(log2(3/3))] – 2/5[-2/2(log2(2/2))-0/2(log2(0/2))] = 0.97 – 0 = 0.97 Gain(Wind) = 0.97 – 3/5[-1/3(log2(1/3))-2/3(log2(2/3))] – 3/5[-1/2(log2(1/2))-1/2(log2(1/2))] = 0.97 – 0.96 = 0.02 11 Outlook Sunny Humidity High No Overcast Yes Rain ?? Normal Yes 12 Rain[3+,2-] Temperature <15 [1+,1-] 15-25 [2+,1-] >25 [0+,0-] Info(Rain) = -3/5(log2(3/5)) -2/5(log2(2/5)) = 0.97 Gain(Outlook) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] – 0/5[-0/0(log2(0/0))-0/0(log2(0/0))] = 0.97 – 0.75 = 0.22 Rain[3+,2-] Humidity High [1+,1-] Normal [2+, 1-] Rain[3+,2-] Wind Gain(Humidity) = 0.97 – 2/5[-1/2(log2(1/2))-1/2(log2(1/2))] – 3/5[-2/3(log2(2/3))-1/3(log2(1/3))] = 0.97 – 0.43 = 0.54 Weak [3+, 0-] Strong [0+, 2-] Gain(Wind) = 0.97 – 3/5[-3/3(log2(3/3))-0/3(log2(0/3))] – 2/5[-0/2(log2(0/2))-2/2(log2(2/2))] = 0.97 – 0 = 0.97 13 Outlook Sunny Humidity High No Overcast Rain Yes Normal Yes Wind Weak Yes Strong No 14 Bayesian Classification  A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities  P(Ci | x1 , x2 ,...,xn ) where xi is the value of attribute Ai   Choose the class label that has the highest probability Foundation: Based on Bayes’ Theorem. P( x1 , x2 ,...,xn | Ci ) P(Ci ) P(Ci | x1 , x2 ,...,xn )  P( x1 , x2 ,...,xn )    P(Ci | x1, x2 ,...,xn ) posteriori probability P(Ci ) prior probability Model: compute from data P( x1, x2 ,...,xn | Ci ) likelihood ? 15 Naïve Bayes Classifier  Problem: joint probabilities are difficult to estimate P( x1, x2 ,...,xn | Ci )  Naïve Beyes Classifier  Assumption: attributes are conditionally independent P( x1 , x2 ,...,xn | Ci )  P( x1 | Ci )P( xn | Ci )  P(C | x , x ,..., x )  n i 1 2 n j 1 P( x j | Ci )P(Ci ) P( x1 , x2 ,..., xn ) 16 Naïve Bayes Classifier A B C m b t m s t g q t h s t g q t g q f g s f h b f h q f m b f P(C=t) = 1/2 P(C=f ) = 1/2 P(A=m|C=t) = 2/5 P(A=m|C=f) = 1/5 P(B=q|C=t) = 2/5 P(B=q|C=f) = 2/5 Test Record: A=m, B=q, C=? 17 SEG4630 Tutorial 6 Made by Wenting Naïve Bayes Classifier  For C=t P(A=m|C=t) * P(B=q|C=t) * P(C=t) = 2/5 * 2/5 * 1/2 = 2/25 Higher!  P(C=t|A=m, B=q) = (2/25) / P(A=m, B=q)   For C=f P(A=m|C=f) * P(B=q|C=f) * P(C=f) = 1/5 * 2/5 * 1/2 = 1/25  P(C=t|A=m, B=q) = (1/25) / P(A=m, B=q)   Conclusion: A=m, B=q, C=t SEG4630 Tutorial 6 Made by Wenting 18 Nearest Neighbor Classification  Input   A set of stored records k: # of nearest neighbors  Output    Compute distance: d ( p, q)   ( p  q ) Identify k nearest neighbors Determine the class label of unknown record based on class labels of nearest neighbors (i.e. by taking majority vote) 2 i i i 19 Nearest Neighbor Classification A Discrete Example  Input Given 8 training instances         P1 P2 P3 P4 P5 P6 P7 P8 (4, 2)  Orange (0.5, 2.5)  Orange (2.5, 2.5)  Orange (3, 3.5)  Orange (5.5, 3.5)  Orange (2, 4)  Black (4, 5)  Black (2.5, 5.5)  Black  Calculate the distances:         d(P1, d(P2, d(P3, d(P4, d(P5, d(P6, d(P7, d(P8, Pn) Pn) Pn) Pn) Pn) Pn) Pn) Pn) = = = = = = = = (4  4) 2  (2  4) 2  2 3.80 2.12 1.12 1.58 2 1 2.12 k=1&k=3  new instance:  Pn (4, 4)  ??? 20 Nearest Neighbor Classification k=3 k=1 P8 P8 P7 P7 Pn Pn P6 P6 P2 P4 P3 P4 P5 P1 P2 P5 P3 P1 21 Nearest Neighbor Classification…  Scaling issues  Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes    Each attribute must follow in the same range Min-Max normalization Example:   Two data records: a = (1, 1000), b = (0.5, 1) dis(a, b) = ? 22 Lazy & Eager Learning  Two  Types of Learning Methodologies Lazy Learning   Instance-based learning. (k-NN) Eager Learning   Decision-tree and Bayesian classification. ANN & SVM P8 P8 P7 P7 Pn Pn P6 P6 P4 P2 P4 P5 P3 P1 P2 P5 P3 P1 23 Lazy & Eager Learning  Key  Lazy Learning     Differences Do not require model building Less time training but more time predicting Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function Eager Learning    Require model building More time training but less time predicting must commit to a single hypothesis that covers the entire instance space 24

3/5[-1/2(log 2 (1/2))

Related documents

Products

Support

3/5[-1/2(log 2 (1/2))

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib