Decision Trees and more! Learning OR with few attributes • Target function: OR of k literals • Goal: learn in time – polynomial in k and log n e and d constants • ELIM makes “slow” progress – might disqualifies only one literal per round – Might remain with O(n) candidate literals ELIM: Algorithm for learning OR • Keep a list of all candidate literals • For every example whose classification is 0: – Erase all the literals that are 1. • Correctness: – Our hypothesis h: An OR of our set of literals. – Our set of literals includes the target OR literals. – Every time h predicts zero: we are correct. • Sample size: – m > (1/e) ln (3n/d)= O (n/e +1/e ln (1/d)) Set Cover - Definition • • • • Input: S1 , … , St and Si U Output: Si1, … , Sik and j Sjk=U Question: Are there k sets that cover U? NP-complete Set Cover: Greedy algorithm • j=0 ; Uj=U; C= • While Uj – – – – Let Si be arg max |Si Uj| Add Si to C Let Uj+1 = Uj – Si j = j+1 Set Cover: Greedy Analysis • • • • • • • At termination, C is a cover. Assume there is a cover C* of size k. C* is a cover for every Uj Some S in C* covers Uj/k elements of Uj Analysis of Uj: |Uj+1| |Uj| - |Uj|/k Solving the recursion. Number of sets j k ln ( |U|+1) Building an Occam algorithm • Given a sample T of size m – Run ELIM on T – Let LIT be the set of remaining literals – Assume there exists k literals in LIT that classify correctly all the sample T • Negative examples T– any subset of LIT classifies T- correctly Building an Occam algorithm • Positive examples T+ – – – – Search for a small subset of LIT which classifies T+ correctly For a literal z build Sz={x | z satisfies x} Our assumption: there are k sets that cover T+ Greedy finds k ln m sets that cover T+ • Output h = OR of the k ln m literals • Size (h) < k ln m log 2n • Sample size m =O( k log n log (k log n)) k-DNF • Definition: – A disjunction of terms at most k literals • Term: T=x3 x1 x5 • DNF: T1 T2 T3 T4 • Example: x1 x4 x3 ) x2 x5 x6 ) x7 x4 x9 ) Learning k-DNF • Extended input: – – – – – – For each AND of k literals define a “new” input T Example: T=x3 x1 x5 Number of new inputs at most (2n)k Can compute the new input easily in time k(2n)k The k-DNF is an OR over the new inputs. Run the ELIM algorithm over the new inputs. • Sample size O ((2n)k/e +1/e ln (1/d)) • Running time: same. Learning Decision Lists • Definition: x4 1 x7 0 +1 1 -1 0 x1 0 -1 1 +1 Learning Decision Lists • Similar to ELIM. • Input: a sample S of size m. • While S not empty: – – – – For a literal z build Tz={x | z satisfies x} Find a Tz which all have the same classification Add z to the decision list Update S = S-Tz DL algorithm: correctness • The output decision list is consistent. • Number of decision lists: – – – – Length < n+1 Node: 2n lirals Leaf: 2 values Total bound (2*2n)n+1 • Sample size: – m = O (n log n/e +1/e ln (1/d)) k-DL • Each node is a conjunction of k literals • Includes k-DNF (and k-CNF) x4 x2 1 x3 x1 0 +1 0 -1 1 x5 x7 0 -1 1 +1 Learning k-DL • Extended input: – – – – – – For each AND of k literals define a “new” input Example: T=x3 x1 x5 Number of new inputs at most (2n)k Can compute the new input easily in time k(2n)k The k-DL is a DL over the new inputs. Run the DL algorithm over the new inputs. • Sample size • Running time Open Problems • Attribute Efficient: – Decision list: very limited results – Parity functions: negative? – k-DNF and k-DL Decision Trees x1 1 0 +1 x6 0 +1 1 -1 Learning Decision Trees Using DL • Consider a decision tree T of size r. • Theorem: – There exists a log (r+1)-DL L that computes T. • Claim: There exists a leaf in T of depth log (r+1). • Learn a Decision Tree using a Decision List • Running time: nlog s – n number of attributes – S Tree Size. Decision Trees x1 > 5 +1 x6 > 2 +1 -1 Decision Trees: Basic Setup. • Basic class of hypotheses H. • Input: Sample of examples • Output: Decision tree – Each internal node from H – Each leaf a classification value • Goal (Occam Razor): – Small decision tree – Classifies all (most) examples correctly. Decision Tree: Why? • Efficient algorithms: – Construction. – Classification • Performance: Comparable to other methods • Software packages: – CART – C4.5 and C5 Decision Trees: This Lecture • Algorithms for constructing DT • A theoretical justification – Using boosting • Future lecture: – DT pruning. Decision Trees Algorithm: Outline • • • • • • A natural recursive procedure. Decide a predicate h at the root. Split the data using h Build right subtree (for h(x)=1) Build left subtree (for h(x)=0) Running time – T(s) = O(s) + T(s+) + T(s-) = O(s log s) – s= Tree size DT: Selecting a Predicate • Basic setting: Pr[f=1]=q v Pr[h=0]=u 0 h 1 v1 Pr[f=1| h=0]=p • Clearly: q=up + (1-u)r Pr[h=1]=1-u v2 Pr[f=1| h=1]=r Potential function: setting • Compare predicates using potential function. – Inputs: q, u, p, r – Output: value • Node dependent: – For each node and predicate assign a value. – Given a split: u val(v1) + (1-u) val(v2) – For a tree: weighted sum over the leaves. PF: classification error • Let val(v)=min{q,1-q} – Classification error. • The average potential only drops • Termination: – When the average is zero – Perfect Classification PF: classification error q=Pr[f=1]=0.8 v u=Pr[h=0]=0.5 h 1-u=Pr[h=1]=0.5 0 v1 p=Pr[f=1| h=0]=0.6 1 v2 p=Pr[f=1| h=1]=1 • Is this a good split? • Initial error 0.2 • After Split 0.4 (1/2) + 0.6(1/2) = 0.2 Potential Function: requirements • When zero perfect classification. • Strictly convex. Potential Function: requirements • Every Change in an improvement val(r) val(q) val(p) 1-u u p q r Potential Functions: Candidates • Potential Functions: – val(q) = Ginni(q)=2q(1-q) CART – val(q)=etropy(q)= -q log q –(1-q) log (1-q) C4.5 – val(q) = sqrt{2 q (1-q) } • Assumption: – Symmetric: val(q) = val(1-q) – Convex – val(0)=val(1) = 0 and val(1/2) =1 DT: Construction Algorithm Procedure DT(S) : S- sample • If all the examples in S have the classification b – Create a leaf of value b and return • For each h compute val(h,S) – val(h,S) = uhval(ph) + (1-uh) val(rh) • Let h’ = arg minh val(h,S) • Split S using h’ to S0 and S1 • Recursively invoke DT(S0) and DT(S1) DT: Analysis • Potential function: – val(T) = Sv leaf of T Pr[v] val(qv) • For simplicity: use true probability • Bounding the classification error – error(T) val(T) – study how fast val(T) drops • Given a tree T define T(l,h) where – h predicate – l leaf. T h Top-Down algorithm • • • • • Input: s = size; H= predicates; val(); T0= single leaf tree For t from 1 to s do Let (l,h) = arg max(l,h){val(Tt) – val(Tt(l,h))} Tt+1 = Tt(l,h) Theoretical Analysis • Assume H satisfies the weak learning hypo. – For each D there is an h s.t. error(h)<1/2-g • Show, that in every step – a significant drop in val(T) • Results weaker than AdaBoost – But algorithm never intended to do it! • Use Weak Learning – show a large drop in val(T) at each step • Modify initial distribution to be unbiased. Theoretical Analysis • Let val(q) = 2q(1-q) • Local drop at a node at least 16g2 [q(1-q)]2 • Claim: At every step t there is a leaf l s.t. : – Pr[l] et/2t – error(l)= min{ql,1-ql} et/2 – where et is the error at stage t • Proof! Theoretical Analysis • Drop at time t at least: – Pr[l] g2 [ql (1-ql)]2 g2 et3 / t • For Ginni index – val(q)=2q(1-q) – q q(1-q) val(q)/2 • Drop at least O(g2 [val(qt)]3/ t ) Theoretical Analysis • Need to solve when val(Tk) < e. • Bound k. • Time exp{O(1/g2 1/ e2)} Something to think about • • • • AdaBoost: very good bounds DT Ginni Index : exponential Comparable results in practice How can it be?