Randomized Variable Elimination David J. Stracuzzi Paul E. Utgoff Agenda • • • • • • • • Background Filter and wrapper methods Randomized Variable Elimination Cost Function RVE algorithm when r is known (RVE) RVE algorithm when r is not known (RVErS) Results Questions Variable Selection Problem • Choosing relevant attributes from set of attributes. • Producing a subset of variables from large set of input variables that best predicts target function. • Forward selection algorithm starts with an empty set and searches for variables to add. • Backward selection algorithm starts with entire set of variables and go on removing irrelevant variable(s). • In some cases, forward selection algorithm also removes variables in order to recover from previous poor selections. • Caruna and Freitag (1994) experimented with greedy search methods and found that allowing search to add or remove variables outperform simple forward and backward searches • Filter and wrapper methods for variable selection. Filter methods • Uses statistical measures to evaluate the quality of variable subsets. • Subset of variables are evaluated with respect to specific quality measure. • Statistical evaluation of variables require very little computational cost as compared to running the learning algorithm. • FOCUS (Almuallim and Dietterich, 1991) searches for smallest subset that completely discriminates between target classes. • Relief (Kira and Rendell, 1992) ranks variables as per distance. • In filter methods, variables are evaluated independently and not in context of learning problem. Wrapper methods • Uses performance of the learning algorithm to evaluate the quality of subset of input variables. • The learning algorithm is executed on the candidate variable set and then tested for the accuracy of resulting hypothesis. • Advantage: Since wrapper methods evaluate variables in the context of learning problem, they outperform filter methods. • Disadvantage: Cost of repeatedly executing the learning algorithm can become problematic. • John, Kohavi, and Pfleger (1994) coined the term “wrapper” but the technique was used before that (Devijver and Kittler, 1982) Randomized Variable Elimination • • • • • • Falls under the category of wrapper methods. First, a hypothesis is produced for entire set of ‘n’ variables. A subset if formed by randomly selecting ‘k’ variables. A hypothesis is then produced for remaining (n-k) variables. Accuracy of the two hypotheses are compared. Removal of any relevant variable should cause an immediate decline in performance • Uses a cost function to achieve a balance between successive failures and cost of running the learning algorithm several times. The Cost Function Probability of selecting ‘k’ variables • The probability of successfully selecting ‘k’ irrelevant variables at random is given by nr i p (n, r , k ) i 0 n i k 1 where, n … remaining variables r … relevant variables Expected number of failures • The expected number of consecutive failures before a success at selecting k irrelevant variables is given by 1 p (n, r , k ) E (n, r , k ) p (n, r , k ) • Number of consecutive trials in which at least one of the r relevant variables will be randomly selected along with irrelevant variables. Cost of removing k variables • The expected cost of successfully removing k variables from n remaining given r relevant variables is given by I (n, r, k ) E (n, r, k ) M ( L, n k ) M ( L, n k ) M ( L, n k )( E (n, r , k ) 1) where, M(L, n) represents an upper bound on the cost of running algorithm ‘L’ on n inputs. Optimal cost of removing irrelevant variables • The optimal cost of removing irrelevant variables from n remaining and r relevant is given by I sum (n, r ) min ( I (n, r , k ) I sum (n k , r )) k Optimal value for ‘k’ • The optimal value kopt (n, r ) is computed as k opt (n, r ) arg min ( I (n, r , k ) I sum (n k , r )) k • It is the value of k for which the cost of removing variables is optimal. Algorithms Algorithm for computing k and cost values • Given: L, N, r • Isum[r+1…N] ← 0 kopt[r+1…N] ← 0 for i ← r+1 to N do bestCost ← ∞ for k ← 1 to i-r do temp ← I(i,r,k) + Isum[i-k] if (temp < bestCost) then bestCost ← temp bestK ← k Isum[i] ← bestCost kopt[i] ← bestK Randomized Variable Elimination (RVE) when r is known • • Given: L,n,r, tolerance Compute tables for Isum(i,r) and kopt(i,r) h ← hypothesis produced by L on ‘n’ inputs • while n > r do k ← kopt(n,r) select k variables at random and remove them h’ ← hypothesis produced by L on n-k inputs if e(h’) – e(h) ≤ tolerance then n ← n-k h ← h’ else replace the selected k variables RVE example • Plot of expected cost of running RVE(Isum(N,r = 10)) along with cost of removing inputs individually, and the estimated number of updates M(L,n). • L is function that learns a boolean function using perceptron unit. Randomized Variable Elimination including a search for ‘r’ (RVErS) • • Given: L, c1, c2, n, rmax , rmin , tolerance Compute tables Isum(i,r) and kopt(i,r) for rmin ≤ r ≤ rmax r ← (rmax + rmin) / 2 success, fail ← 0 h ← hypothesis produced by L on ‘n’ inputs • repeat k ← kopt(n,r) select k variables at random and remove them h’ ← hypothesis produced by L on (n-k) inputs if e(h’) – e(h) ≤ tolerance then n←n–k h ← h’ success ← success + 1 fail ← 0 else replace the selected k variables fail ← fail + 1 success ← 0 RVErS (contd…) if n ≤ rmin then r, rmax, rmin ← n else if fail ≥ c1E⁻(n,r,k) then rmin ← r r ← (rmax + rmin) / 2 success, fail ← 0 else if success ≥ c2(r – E⁻(n,r,k)) then rmax ← r r ← (rmax + rmin) / 2 success, fail ← 0 until rmin < rmax and fail ≤ c1E⁻(n,r,k) Comparison of RVE and RVErS Results Variable Selection results using naïve Bayes and C4.5 algorithms Learning Selection Data Set Algorithm Algorithm Bayes r_max = 25 r_max = 75 r_max = 150 k=1 forward backward filter Subset Evals Iters 127 293 434 423 14 14 150 Percent Time Inputs Error (sec) Search Cost 150 30.3 +- 3.0 0.09 275000 172 22.7 26.9 +- 3.9 19 39700000 293 17.4 26.0 +- 3.3 50 109000000 434 25.6 25.9 +- 2.6 86 202000000 423 23.7 27.0 +- 2.1 85 204000000 2006 13 26.6 +- 2.9 141 154000000 1870 138 30.1 +- 2.6 667 1950000000 150 23.7 27.1 +- 2.1 34 84900000 LED C4.5 r_max = 25 r_max = 75 r_max = 150 k=1 forward backward filter 85 468 541 510 9 61 150 85 468 541 510 1286 7218 150 150 43.9 +- 4.5 51.1 42.0 +- 3.0 25.8 42.5 +- 4.5 25.2 40.8 +- 5.7 32.4 42.5 +- 2.7 7.8 27.0 +- 3.2 90.9 43.5 +- 3.5 7.1 27.3 +- 3.5 0.5 89 363 440 439 196 11481 156 54800 10100000 37000000 44800000 952000 133000000 133000000 16900000 Variable Selection results using naïve Bayes and C4.5 algorithms Learning Selection Data Set Algorithm Algorithm Bayes r_max = 15 r_max = 25 r_max = 35 k=1 forward backward filter Subset Evals Iters 142 135 132 88 13 19 35 Percent Time Inputs Error (sec) Search Cost 35 7.8 +- 2.4 0.02 24000 142 12.6 8.9 +- 4.2 5.9 6470000 135 11.9 10.5 +- 5.8 5.8 6260000 132 11.2 9.8 +- 5.1 5.8 6170000 88 12.3 9.6 +- 5.0 4.6 4670000 382 12.3 7.3 +- 2.9 8.9 10900000 472 18 7.9 +- 4.6 37.5 36300000 35 31.3 7.8 +- 2.6 2 1970000 soybean C4.5 r_max = 15 r_max = 25 r_max = 35 k=1 forward backward filter 118 158 139 117 16 18 35 118 158 139 117 435 455 35 35 8.6 +- 4.0 16.3 9.5 +- 4.6 14.7 10.1 +- 4.1 16.3 9.1 +- 3.7 16.1 9.3 +- 3.5 14.8 9.1 +- 4.0 19.1 10.4 +- 4.4 30.8 8.5 +- 3.7 0.04 13.5 18.6 17.3 14.9 33.7 69 3.7 1210 278000 190000 386000 352000 322000 1750000 60600 My implementation • • • • Integrate with Weka Extend the NaiveBayes and J48 algorithms Obtain results for some UCI datasets used Compare results with those reported by authors • Work in progress RECAP Questions References • H. Almuallim and T.G Dietterich. Leraning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, Anaheim, CA, 1991. MIT Press. • R. Caruna and D. Freitag. Greedy attribute selection. In Machine Learning: Proceedings of Eleventh International Conference, Amherst, MA, 1993. Morgan Kaufmann. • K. Kira and L. Rendell. A practical approach to feature selection. In D. Sleeman and P. Edwards, editors, Machine Learning: Proceedings of Ninth International Conference, San Mateo, CA, 1992. Morgan Kaufmann. References (contd…) • G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and subset selection problem. In Machine Learning: Proceedings of Eleventh Internaltional Conference, pages 121-129, New Brunswick, NJ, 1994. Morgan Kauffmann. • P.A. Devijver and J. Kittler. Pattern Recognition: A statistical approach. Prentice Hall/International, 1982