SVM and Decision Tree Le Song Machine Learning I CSE 6740, Fall 2013 Which decision boundary is better? Suppose the training samples are linearly separable We can find a decision boundary which gives zero training error Class 2 But there are many such decision boundaries Which one is better? Class 1 2 Compare two decision boundaries Suppose we perturb the data, which boundary is more susceptible to error? 3 Constraints on data points Constraints on data points For all π₯ in class 2, π¦ = 1 and π€ β€ π₯ + π ≥ π For all π₯ in class 1, π¦ = −1 and π€ β€ π₯ + π ≤ −π Or more compactly, (π€ β€ π₯ + π)π¦ ≥ π w xο«b ο½ 0 T π€ Class 2 Class 1 c c 4 Classifier margin Pick two data points π₯ 1 and π₯ 2 which are on each dashed line respectively 1 π€ The margin is πΎ = π€ β€ 1 π₯ −π₯ 2 w xο«b ο½ 0 = 2π |π€| T π€ π₯1 Class 2 π₯2 Class 1 c c 5 Maximum margin classifier Find decision boundary π€ as far from data point as possible 2π max π€,π | π€ | π . π‘. π¦ π π€ β€ π₯ π + π ≥ π, ∀π w xο«b ο½ 0 T π€ π₯1 Class 2 π₯2 Class 1 c c 6 Support vector machines with hard margin min π . π‘. π€,π π¦π π€ β€ π₯π π€ 2 + π ≥ 1, ∀π Convert to standard form 1 β€ min π€ π€ π€,π 2 π . π‘. 1 − π¦ π π€ β€ π₯ π + π ≤ 0, ∀π The Lagrangian function 1 β€ πΏ π€, πΌ, π½ = π€ π€ + 2 π πΌπ 1 − π¦ π π€ β€ π₯ π + π π 7 Deriving the dual problem π 1 β€ πΏ π€, πΌ, π½ = π€ π€ + 2 πΌπ 1 − π¦ π π€ β€ π₯ π + π π Taking derivative and set to zero π ππΏ =π€− ππ€ ππΏ = ππ πΌπ π¦ π π₯ π = 0 π π πΌπ π¦ π = 0 π 8 Plug back relation of w and b πΏ π€, πΌ, π½ = 1 π ππ₯π πΌ π¦ π π 2 π π πΌπ 1 − π¦π β€ π ππ₯π πΌ π¦ π π + β€ π π π π πΌπ π¦ π₯ After simplification π πΏ π€, πΌ, π½ = π 1 πΌπ − 2 π₯π + π π πΌπ πΌπ π¦π π¦ π β€ π π π₯ π₯ π,π 9 The dual problem π max πΌ π 1 πΌπ − 2 π πΌπ πΌπ π¦π π¦ π β€ π π π₯ π₯ π,π π . π‘. πΌπ ≥ 0, π = 1, … , π π πΌπ π¦ π = 0 π This is a constrained quadratic programming Nice and convex, and global maximum can be found π€ can be found as π€ = How about π? π π π π πΌπ π¦ π₯ 10 Support vectors Note that the KKT condition πΌπ 1 − π¦ π π€ β€ π₯ π + π =0 For data points with 1 − π¦ π π€ β€ π₯ π + π < 0 , πΌπ = 0 For data points with 1 − π¦ π π€ β€ π₯ π + π = 0 , πΌπ > 0 Class 2 a8=0.6 a10=0 a7=0 a5=0 a4=0 a9=0 Class 1 a2=0 Call the training data points whose ai's are nonzero the support vectors (SV) a1=0.8 a6=1.4 a3=0 11 Computing b and obtain the classifer Pick any data point with πΌπ > 0, solve for π with 1 − π¦π π€ β€ π₯π + π = 0 For a new test point z Compute π€β€π§ + π = πΌπ π¦ π π₯ π π§ + π π∈π π’πππππ‘ π£πππ‘πππ Classify π§ as class 1 if the result is positive, and class 2 otherwise 12 Interpretation of support vector machines The optimal w is a linear combination of a small number of data points. This “sparse” representation can be viewed as data compression To compute the weights πΌπ , and to use support vector machines we need to specify only the inner products (or β€ π π kernel) between the examples π₯ π₯ We make decisions by comparing each new example z with only the support vectors: π¦ ∗ = π πππ πΌπ π¦ π π₯ π π§ + π π∈π π’πππππ‘ π£πππ‘πππ 13 Soft margin constraints What if the data is not linearly separable? We will allow points to violate the hard margin constraint (π€ β€ π₯ + π)π¦ ≥ 1 − π w xο«b ο½ 0 T π€ π1 π2 Class 1 Class 2 π3 1 1 14 Soft margin SVM min π€ π€,π,π 2 π ππ +πΆ π=1 π π . π‘. π¦ π π€ β€ π₯ π + π ≥ 1 − π , π π ≥ 0, ∀π Convert to standard form 1 β€ min π€ π€ π€,π 2 π . π‘. 1 − π¦ π π€ β€ π₯ π + π − π π ≤ 0, π π ≥ 0, ∀π The Lagrangian function 1 β€ = π€ π€+ 2 π πΏ π€, πΌ, π½ πΆπ π + πΌπ 1 − π¦ π π€ β€ π₯ π + π − π π − π½π π π π 15 Deriving the dual problem 1 β€ = π€ π€+ 2 π πΏ π€, πΌ, π½ πΆπ π + πΌπ 1 − π¦ π π€ β€ π₯ π + π − π π − π½π π π π Taking derivative and set to zero ππΏ =π€− ππ€ π π πΌπ π¦ π π₯ π = 0 π ππΏ = πΌπ π¦ π = 0 ππ π ππΏ = πΆ − πΌπ − π½π = 0 π ππ 16 Plug back relation of π€, π and π πΏ π€, πΌ, π½ = 1 π ππ₯π πΌ π¦ π π 2 π π πΌπ 1 − π¦π β€ π ππ₯π πΌ π¦ π π + β€ π π π π πΌπ π¦ π₯ After simplification π πΏ π€, πΌ, π½ = π 1 πΌπ − 2 π₯π + π π πΌπ πΌπ π¦π π¦ π β€ π π π₯ π₯ π,π 17 The dual problem π max πΌ π 1 πΌπ − 2 π πΌπ πΌπ π¦π π¦ π β€ π π π₯ π₯ π,π π . π‘. πΆ − πΌπ − π½π = 0, πΌπ ≥ 0, π½π ≥ 0, π = 1, … , π π πΌπ π¦ π = 0 π The constraint πΆ − πΌπ − π½π = 0, πΌπ ≥ 0, π½π ≥ 0 can be simplified to πΆ ≥ πΌπ ≥ 0 This is a constrained quadratic programming Nice and convex, and global maximum can be found 18 Learning nonlinear decision boundary Linearly separable Nonlinearly separable The XOR gate Speech recognition 19 A decision tree for Tax Fraud Input: a vector of attributes π = [Refund,MarSt,TaxInc] Output: π= Cheating or Not H as a procedure: Each internal node: test one attribute ππ Each branch from a node: selects one value for ππ Each leaf node: predict π Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 20 Apply model to test data I Query Data Start from the root of tree. R e fu n d No Refund Yes M a r it a l T a x a b le S ta tu s In c o m e C heat M a r r ie d 80K ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 21 Apply model to test data II Query Data R e fu n d No Refund Yes M a r it a l T a x a b le S ta tu s In c o m e C heat M a r r ie d 80K ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 22 Apply model to test data III Query Data R e fu n d No Refund Yes M a r it a l T a x a b le S ta tu s In c o m e C heat M a r r ie d 80K ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 23 Apply model to test data IV Query Data R e fu n d No Refund Yes M a r it a l T a x a b le S ta tu s In c o m e C heat M a r r ie d 80K ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 24 Apply model to test data V Query Data R e fu n d No Refund Yes M a r it a l T a x a b le S ta tu s In c o m e C heat M a r r ie d 80K ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married Assign Cheat to “No” NO > 80K YES 25 Expressiveness of decision tree Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row → path to leaf: Trivially, there is a consistent decision tree for any training set with one path to leaf for each example. Prefer to find more compact decision trees 26 Hypothesis spaces (model space How many distinct decision trees with n Boolean attributes? = number of Boolean functions n n 2 = number of distinct truth tables with 2 rows = 2 E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry ο οRain)? Each attribute can be in (positive), in (negative), or out ο 3n distinct conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent with training set ο may get worse predictions 27 Decision tree learning Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tree Induction algorithm Induction Learn Model Model 10 Training Set Tid Attrib1 11 No Small Attrib2 55K Attrib3 ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree Deduction 10 Test Set 28 Example of a decision tree T id R e fu n d Splitting Attributes M a r it a l T a x a b le S ta tu s In c o m e C heat 1 Yes S in g le 125K No 2 No M a r r ie d 100K No 3 No S in g le 70K No 4 Yes M a r r ie d 120K No 5 No D iv o r c e d 95K Yes 6 No M a r r ie d 60K No 7 Yes D iv o r c e d 220K No 8 No S in g le 85K Yes 9 No M a r r ie d 75K No 10 No S in g le 90K Yes Refund Yes No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES 10 Training Data Model: Decision Tree 29 Another example of a decision tree MarSt T id R e fu n d Married M a r it a l T a x a b le S ta tu s In c o m e C heat NO 1 Yes S in g le 125K No 2 No M a r r ie d 100K No 3 No S in g le 70K No 4 Yes M a r r ie d 120K No 5 No D iv o r c e d 95K Yes 6 No M a r r ie d 60K No 7 Yes D iv o r c e d 220K No 8 No S in g le 85K Yes 9 No M a r r ie d 75K No 10 No S in g le 90K Yes Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! 10 Training Data 30 Top-Down Induction of Decision tree Main loop: π΄ ← the “best” decision attribute for next node Assign A as the decision attribute for node For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples perfectly classified, then STOP; ELSE iterate over new leaf nodes 31 Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting 32 Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. CarType Family Luxury Sports Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} CarType {Sports} Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Size Small Large Medium Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium} Size {Large} OR {Medium, Large} Size {Small} Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (π΄ < π‘) or (π΄ ο³ π‘) consider all possible splits and finds the best cut can be more compute intensive Taxable Income > 80K? Taxable Income? < 10K Yes > 80K No [10K,25K) (i) Binary split [25K,50K) [50K,80K) (ii) Multi-way split How to determine the Best Split Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Homogeneous, Non-homogeneous, Low degree of impurity High degree of impurity Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity How to compare attribute? Entropy Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Information theory: Most efficient code assigns -log2P(X=i) bits to encode the message X=I, So, expected number of bits to code one random X is: Sample Entropy S is a sample of training examples p+ is the proportion of positive examples in S p- is the proportion of negative examples in S Entropy measure the impurity of S Examples for computing Entropy C1 0 P(C1) = 0/6 = 0 C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 C1 1 P(C1) = 1/6 C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 C1 2 P(C1) = 2/6 C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92 P(C2) = 6/6 = 1 P(C2) = 5/6 P(C2) = 4/6 How to compare attribute? Conditional Entropy of variable π given variable π Given specific Y=v entropy H(X|Y=v) of X: Conditional entropy H(X|Y) of X: average of H(X|Y=v) Mutual information (aka information gain) of X given Y : Information Gain Information gain (after split a node): ni ο¦ ( p) ο ο§ ο₯ Entropy ο¨ i ο½1 n k GAIN split ο½ Entropy οΆ (i ) ο· οΈ π samples in parent node π is split into π partitions; ππ is number of records in partition π Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Problem of splitting using information gain Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Gain Ratio: GainRATIO split ο½ GAIN k Split SplitINFO SplitINFO ο½ οο₯ i ο½1 ni n log ni n Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain 42 Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination (to be discussed later) Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets Example: C4.5 Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. You can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz