Machine Learning

Decision Trees and more! Learning OR with few attributes • Target function: OR of k literals • Goal: learn in time – polynomial in k and log n  e and d constants • ELIM makes “slow” progress – might disqualifies only one literal per round – Might remain with O(n) candidate literals ELIM: Algorithm for learning OR • Keep a list of all candidate literals • For every example whose classification is 0: – Erase all the literals that are 1. • Correctness: – Our hypothesis h: An OR of our set of literals. – Our set of literals includes the target OR literals. – Every time h predicts zero: we are correct. • Sample size: – m > (1/e) ln (3n/d)= O (n/e +1/e ln (1/d)) Set Cover - Definition • • • • Input: S1 , … , St and Si  U Output: Si1, … , Sik and j Sjk=U Question: Are there k sets that cover U? NP-complete Set Cover: Greedy algorithm • j=0 ; Uj=U; C= • While Uj   – – – – Let Si be arg max |Si  Uj| Add Si to C Let Uj+1 = Uj – Si j = j+1 Set Cover: Greedy Analysis • • • • • • • At termination, C is a cover. Assume there is a cover C* of size k. C* is a cover for every Uj Some S in C* covers Uj/k elements of Uj Analysis of Uj: |Uj+1|  |Uj| - |Uj|/k Solving the recursion. Number of sets j  k ln ( |U|+1) Building an Occam algorithm • Given a sample T of size m – Run ELIM on T – Let LIT be the set of remaining literals – Assume there exists k literals in LIT that classify correctly all the sample T • Negative examples T– any subset of LIT classifies T- correctly Building an Occam algorithm • Positive examples T+ – – – – Search for a small subset of LIT which classifies T+ correctly For a literal z build Sz={x | z satisfies x} Our assumption: there are k sets that cover T+ Greedy finds k ln m sets that cover T+ • Output h = OR of the k ln m literals • Size (h) < k ln m log 2n • Sample size m =O( k log n log (k log n)) k-DNF • Definition: – A disjunction of terms at most k literals • Term: T=x3 x1  x5 • DNF: T1 T2  T3  T4 • Example: x1  x4  x3 )  x2  x5  x6 )  x7  x4  x9 ) Learning k-DNF • Extended input: – – – – – – For each AND of k literals define a “new” input T Example: T=x3 x1  x5 Number of new inputs at most (2n)k Can compute the new input easily in time k(2n)k The k-DNF is an OR over the new inputs. Run the ELIM algorithm over the new inputs. • Sample size O ((2n)k/e +1/e ln (1/d)) • Running time: same. Learning Decision Lists • Definition: x4 1 x7 0 +1 1 -1 0 x1 0 -1 1 +1 Learning Decision Lists • Similar to ELIM. • Input: a sample S of size m. • While S not empty: – – – – For a literal z build Tz={x | z satisfies x} Find a Tz which all have the same classification Add z to the decision list Update S = S-Tz DL algorithm: correctness • The output decision list is consistent. • Number of decision lists: – – – – Length < n+1 Node: 2n lirals Leaf: 2 values Total bound (2*2n)n+1 • Sample size: – m = O (n log n/e +1/e ln (1/d)) k-DL • Each node is a conjunction of k literals • Includes k-DNF (and k-CNF) x4 x2 1 x3 x1 0 +1 0 -1 1 x5 x7 0 -1 1 +1 Learning k-DL • Extended input: – – – – – – For each AND of k literals define a “new” input Example: T=x3 x1  x5 Number of new inputs at most (2n)k Can compute the new input easily in time k(2n)k The k-DL is a DL over the new inputs. Run the DL algorithm over the new inputs. • Sample size • Running time Open Problems • Attribute Efficient: – Decision list: very limited results – Parity functions: negative? – k-DNF and k-DL Decision Trees x1 1 0 +1 x6 0 +1 1 -1 Learning Decision Trees Using DL • Consider a decision tree T of size r. • Theorem: – There exists a log (r+1)-DL L that computes T. • Claim: There exists a leaf in T of depth log (r+1). • Learn a Decision Tree using a Decision List • Running time: nlog s – n number of attributes – S Tree Size. Decision Trees x1 > 5 +1 x6 > 2 +1 -1 Decision Trees: Basic Setup. • Basic class of hypotheses H. • Input: Sample of examples • Output: Decision tree – Each internal node from H – Each leaf a classification value • Goal (Occam Razor): – Small decision tree – Classifies all (most) examples correctly. Decision Tree: Why? • Efficient algorithms: – Construction. – Classification • Performance: Comparable to other methods • Software packages: – CART – C4.5 and C5 Decision Trees: This Lecture • Algorithms for constructing DT • A theoretical justification – Using boosting • Future lecture: – DT pruning. Decision Trees Algorithm: Outline • • • • • • A natural recursive procedure. Decide a predicate h at the root. Split the data using h Build right subtree (for h(x)=1) Build left subtree (for h(x)=0) Running time – T(s) = O(s) + T(s+) + T(s-) = O(s log s) – s= Tree size DT: Selecting a Predicate • Basic setting: Pr[f=1]=q v Pr[h=0]=u 0 h 1 v1 Pr[f=1| h=0]=p • Clearly: q=up + (1-u)r Pr[h=1]=1-u v2 Pr[f=1| h=1]=r Potential function: setting • Compare predicates using potential function. – Inputs: q, u, p, r – Output: value • Node dependent: – For each node and predicate assign a value. – Given a split: u val(v1) + (1-u) val(v2) – For a tree: weighted sum over the leaves. PF: classification error • Let val(v)=min{q,1-q} – Classification error. • The average potential only drops • Termination: – When the average is zero – Perfect Classification PF: classification error q=Pr[f=1]=0.8 v u=Pr[h=0]=0.5 h 1-u=Pr[h=1]=0.5 0 v1 p=Pr[f=1| h=0]=0.6 1 v2 p=Pr[f=1| h=1]=1 • Is this a good split? • Initial error 0.2 • After Split 0.4 (1/2) + 0.6(1/2) = 0.2 Potential Function: requirements • When zero perfect classification. • Strictly convex. Potential Function: requirements • Every Change in an improvement val(r) val(q) val(p) 1-u u p q r Potential Functions: Candidates • Potential Functions: – val(q) = Ginni(q)=2q(1-q) CART – val(q)=etropy(q)= -q log q –(1-q) log (1-q) C4.5 – val(q) = sqrt{2 q (1-q) } • Assumption: – Symmetric: val(q) = val(1-q) – Convex – val(0)=val(1) = 0 and val(1/2) =1 DT: Construction Algorithm Procedure DT(S) : S- sample • If all the examples in S have the classification b – Create a leaf of value b and return • For each h compute val(h,S) – val(h,S) = uhval(ph) + (1-uh) val(rh) • Let h’ = arg minh val(h,S) • Split S using h’ to S0 and S1 • Recursively invoke DT(S0) and DT(S1) DT: Analysis • Potential function: – val(T) = Sv leaf of T Pr[v] val(qv) • For simplicity: use true probability • Bounding the classification error – error(T)  val(T) – study how fast val(T) drops • Given a tree T define T(l,h) where – h predicate – l leaf. T h Top-Down algorithm • • • • • Input: s = size; H= predicates; val(); T0= single leaf tree For t from 1 to s do Let (l,h) = arg max(l,h){val(Tt) – val(Tt(l,h))} Tt+1 = Tt(l,h) Theoretical Analysis • Assume H satisfies the weak learning hypo. – For each D there is an h s.t. error(h)<1/2-g • Show, that in every step – a significant drop in val(T) • Results weaker than AdaBoost – But algorithm never intended to do it! • Use Weak Learning – show a large drop in val(T) at each step • Modify initial distribution to be unbiased. Theoretical Analysis • Let val(q) = 2q(1-q) • Local drop at a node at least 16g2 [q(1-q)]2 • Claim: At every step t there is a leaf l s.t. : – Pr[l]  et/2t – error(l)= min{ql,1-ql}  et/2 – where et is the error at stage t • Proof! Theoretical Analysis • Drop at time t at least: – Pr[l] g2 [ql (1-ql)]2  g2 et3 / t • For Ginni index – val(q)=2q(1-q) – q  q(1-q)  val(q)/2 • Drop at least O(g2 [val(qt)]3/ t ) Theoretical Analysis • Need to solve when val(Tk) < e. • Bound k. • Time exp{O(1/g2 1/ e2)} Something to think about • • • • AdaBoost: very good bounds DT Ginni Index : exponential Comparable results in practice How can it be?

Machine Learning

Related documents

Products

Support

Machine Learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib