Randomized Variable Elimination David J. Stracuzzi Paul E. Utgoff

Randomized Variable Elimination David J. Stracuzzi Paul E. Utgoff Agenda • • • • • • • • Background Filter and wrapper methods Randomized Variable Elimination Cost Function RVE algorithm when r is known (RVE) RVE algorithm when r is not known (RVErS) Results Questions Variable Selection Problem • Choosing relevant attributes from set of attributes. • Producing a subset of variables from large set of input variables that best predicts target function. • Forward selection algorithm starts with an empty set and searches for variables to add. • Backward selection algorithm starts with entire set of variables and go on removing irrelevant variable(s). • In some cases, forward selection algorithm also removes variables in order to recover from previous poor selections. • Caruna and Freitag (1994) experimented with greedy search methods and found that allowing search to add or remove variables outperform simple forward and backward searches • Filter and wrapper methods for variable selection. Filter methods • Uses statistical measures to evaluate the quality of variable subsets. • Subset of variables are evaluated with respect to specific quality measure. • Statistical evaluation of variables require very little computational cost as compared to running the learning algorithm. • FOCUS (Almuallim and Dietterich, 1991) searches for smallest subset that completely discriminates between target classes. • Relief (Kira and Rendell, 1992) ranks variables as per distance. • In filter methods, variables are evaluated independently and not in context of learning problem. Wrapper methods • Uses performance of the learning algorithm to evaluate the quality of subset of input variables. • The learning algorithm is executed on the candidate variable set and then tested for the accuracy of resulting hypothesis. • Advantage: Since wrapper methods evaluate variables in the context of learning problem, they outperform filter methods. • Disadvantage: Cost of repeatedly executing the learning algorithm can become problematic. • John, Kohavi, and Pfleger (1994) coined the term “wrapper” but the technique was used before that (Devijver and Kittler, 1982) Randomized Variable Elimination • • • • • • Falls under the category of wrapper methods. First, a hypothesis is produced for entire set of ‘n’ variables. A subset if formed by randomly selecting ‘k’ variables. A hypothesis is then produced for remaining (n-k) variables. Accuracy of the two hypotheses are compared. Removal of any relevant variable should cause an immediate decline in performance • Uses a cost function to achieve a balance between successive failures and cost of running the learning algorithm several times. The Cost Function Probability of selecting ‘k’ variables • The probability of successfully selecting ‘k’ irrelevant variables at random is given by  nr i  p (n, r , k )     i 0  n  i   k 1 where, n … remaining variables r … relevant variables Expected number of failures • The expected number of consecutive failures before a success at selecting k irrelevant variables is given by  1  p (n, r , k )  E (n, r , k )  p  (n, r , k ) • Number of consecutive trials in which at least one of the r relevant variables will be randomly selected along with irrelevant variables. Cost of removing k variables • The expected cost of successfully removing k variables from n remaining given r relevant variables is given by  I (n, r, k )  E (n, r, k )  M ( L, n  k )  M ( L, n  k )  M ( L, n  k )( E  (n, r , k )  1) where, M(L, n) represents an upper bound on the cost of running algorithm ‘L’ on n inputs. Optimal cost of removing irrelevant variables • The optimal cost of removing irrelevant variables from n remaining and r relevant is given by I sum (n, r )  min ( I (n, r , k )  I sum (n  k , r )) k Optimal value for ‘k’ • The optimal value kopt (n, r ) is computed as k opt (n, r )  arg min ( I (n, r , k )  I sum (n  k , r )) k • It is the value of k for which the cost of removing variables is optimal. Algorithms Algorithm for computing k and cost values • Given: L, N, r • Isum[r+1…N] ← 0 kopt[r+1…N] ← 0 for i ← r+1 to N do bestCost ← ∞ for k ← 1 to i-r do temp ← I(i,r,k) + Isum[i-k] if (temp < bestCost) then bestCost ← temp bestK ← k Isum[i] ← bestCost kopt[i] ← bestK Randomized Variable Elimination (RVE) when r is known • • Given: L,n,r, tolerance Compute tables for Isum(i,r) and kopt(i,r) h ← hypothesis produced by L on ‘n’ inputs • while n > r do k ← kopt(n,r) select k variables at random and remove them h’ ← hypothesis produced by L on n-k inputs if e(h’) – e(h) ≤ tolerance then n ← n-k h ← h’ else replace the selected k variables RVE example • Plot of expected cost of running RVE(Isum(N,r = 10)) along with cost of removing inputs individually, and the estimated number of updates M(L,n). • L is function that learns a boolean function using perceptron unit. Randomized Variable Elimination including a search for ‘r’ (RVErS) • • Given: L, c1, c2, n, rmax , rmin , tolerance Compute tables Isum(i,r) and kopt(i,r) for rmin ≤ r ≤ rmax r ← (rmax + rmin) / 2 success, fail ← 0 h ← hypothesis produced by L on ‘n’ inputs • repeat k ← kopt(n,r) select k variables at random and remove them h’ ← hypothesis produced by L on (n-k) inputs if e(h’) – e(h) ≤ tolerance then n←n–k h ← h’ success ← success + 1 fail ← 0 else replace the selected k variables fail ← fail + 1 success ← 0 RVErS (contd…) if n ≤ rmin then r, rmax, rmin ← n else if fail ≥ c1E⁻(n,r,k) then rmin ← r r ← (rmax + rmin) / 2 success, fail ← 0 else if success ≥ c2(r – E⁻(n,r,k)) then rmax ← r r ← (rmax + rmin) / 2 success, fail ← 0 until rmin < rmax and fail ≤ c1E⁻(n,r,k) Comparison of RVE and RVErS Results Variable Selection results using naïve Bayes and C4.5 algorithms Learning Selection Data Set Algorithm Algorithm Bayes r_max = 25 r_max = 75 r_max = 150 k=1 forward backward filter Subset Evals Iters 127 293 434 423 14 14 150 Percent Time Inputs Error (sec) Search Cost 150 30.3 +- 3.0 0.09 275000 172 22.7 26.9 +- 3.9 19 39700000 293 17.4 26.0 +- 3.3 50 109000000 434 25.6 25.9 +- 2.6 86 202000000 423 23.7 27.0 +- 2.1 85 204000000 2006 13 26.6 +- 2.9 141 154000000 1870 138 30.1 +- 2.6 667 1950000000 150 23.7 27.1 +- 2.1 34 84900000 LED C4.5 r_max = 25 r_max = 75 r_max = 150 k=1 forward backward filter 85 468 541 510 9 61 150 85 468 541 510 1286 7218 150 150 43.9 +- 4.5 51.1 42.0 +- 3.0 25.8 42.5 +- 4.5 25.2 40.8 +- 5.7 32.4 42.5 +- 2.7 7.8 27.0 +- 3.2 90.9 43.5 +- 3.5 7.1 27.3 +- 3.5 0.5 89 363 440 439 196 11481 156 54800 10100000 37000000 44800000 952000 133000000 133000000 16900000 Variable Selection results using naïve Bayes and C4.5 algorithms Learning Selection Data Set Algorithm Algorithm Bayes r_max = 15 r_max = 25 r_max = 35 k=1 forward backward filter Subset Evals Iters 142 135 132 88 13 19 35 Percent Time Inputs Error (sec) Search Cost 35 7.8 +- 2.4 0.02 24000 142 12.6 8.9 +- 4.2 5.9 6470000 135 11.9 10.5 +- 5.8 5.8 6260000 132 11.2 9.8 +- 5.1 5.8 6170000 88 12.3 9.6 +- 5.0 4.6 4670000 382 12.3 7.3 +- 2.9 8.9 10900000 472 18 7.9 +- 4.6 37.5 36300000 35 31.3 7.8 +- 2.6 2 1970000 soybean C4.5 r_max = 15 r_max = 25 r_max = 35 k=1 forward backward filter 118 158 139 117 16 18 35 118 158 139 117 435 455 35 35 8.6 +- 4.0 16.3 9.5 +- 4.6 14.7 10.1 +- 4.1 16.3 9.1 +- 3.7 16.1 9.3 +- 3.5 14.8 9.1 +- 4.0 19.1 10.4 +- 4.4 30.8 8.5 +- 3.7 0.04 13.5 18.6 17.3 14.9 33.7 69 3.7 1210 278000 190000 386000 352000 322000 1750000 60600 My implementation • • • • Integrate with Weka Extend the NaiveBayes and J48 algorithms Obtain results for some UCI datasets used Compare results with those reported by authors • Work in progress RECAP Questions References • H. Almuallim and T.G Dietterich. Leraning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, Anaheim, CA, 1991. MIT Press. • R. Caruna and D. Freitag. Greedy attribute selection. In Machine Learning: Proceedings of Eleventh International Conference, Amherst, MA, 1993. Morgan Kaufmann. • K. Kira and L. Rendell. A practical approach to feature selection. In D. Sleeman and P. Edwards, editors, Machine Learning: Proceedings of Ninth International Conference, San Mateo, CA, 1992. Morgan Kaufmann. References (contd…) • G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and subset selection problem. In Machine Learning: Proceedings of Eleventh Internaltional Conference, pages 121-129, New Brunswick, NJ, 1994. Morgan Kauffmann. • P.A. Devijver and J. Kittler. Pattern Recognition: A statistical approach. Prentice Hall/International, 1982

Randomized Variable Elimination David J. Stracuzzi Paul E. Utgoff

Related documents

Products

Support

Randomized Variable Elimination David J. Stracuzzi Paul E. Utgoff

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib