Randomized Variable Elimination David J. Stracuzzi Paul E. Utgoff

advertisement
Randomized Variable Elimination
David J. Stracuzzi
Paul E. Utgoff
Agenda
•
•
•
•
•
•
•
•
Background
Filter and wrapper methods
Randomized Variable Elimination
Cost Function
RVE algorithm when r is known (RVE)
RVE algorithm when r is not known (RVErS)
Results
Questions
Variable Selection Problem
• Choosing relevant attributes from set of attributes.
• Producing a subset of variables from large set of input variables that
best predicts target function.
• Forward selection algorithm starts with an empty set and searches
for variables to add.
• Backward selection algorithm starts with entire set of variables and
go on removing irrelevant variable(s).
• In some cases, forward selection algorithm also removes variables
in order to recover from previous poor selections.
• Caruna and Freitag (1994) experimented with greedy search
methods and found that allowing search to add or remove variables
outperform simple forward and backward searches
• Filter and wrapper methods for variable selection.
Filter methods
• Uses statistical measures to evaluate the quality of variable
subsets.
• Subset of variables are evaluated with respect to specific
quality measure.
• Statistical evaluation of variables require very little
computational cost as compared to running the learning
algorithm.
• FOCUS (Almuallim and Dietterich, 1991) searches for smallest
subset that completely discriminates between target classes.
• Relief (Kira and Rendell, 1992) ranks variables as per distance.
• In filter methods, variables are evaluated independently and
not in context of learning problem.
Wrapper methods
• Uses performance of the learning algorithm to evaluate the
quality of subset of input variables.
• The learning algorithm is executed on the candidate variable
set and then tested for the accuracy of resulting hypothesis.
• Advantage: Since wrapper methods evaluate variables in the
context of learning problem, they outperform filter methods.
• Disadvantage: Cost of repeatedly executing the learning
algorithm can become problematic.
• John, Kohavi, and Pfleger (1994) coined the term “wrapper”
but the technique was used before that (Devijver and Kittler,
1982)
Randomized Variable Elimination
•
•
•
•
•
•
Falls under the category of wrapper methods.
First, a hypothesis is produced for entire set of ‘n’ variables.
A subset if formed by randomly selecting ‘k’ variables.
A hypothesis is then produced for remaining (n-k) variables.
Accuracy of the two hypotheses are compared.
Removal of any relevant variable should cause an immediate
decline in performance
• Uses a cost function to achieve a balance between successive
failures and cost of running the learning algorithm several
times.
The Cost Function
Probability of selecting ‘k’ variables
• The probability of successfully selecting ‘k’
irrelevant variables at random is given by
 nr i 
p (n, r , k )   

i 0  n  i 

k 1
where,
n … remaining variables
r … relevant variables
Expected number of failures
• The expected number of consecutive failures
before a success at selecting k irrelevant
variables is given by

1

p
(n, r , k )

E (n, r , k ) 
p  (n, r , k )
• Number of consecutive trials in which at least
one of the r relevant variables will be
randomly selected along with irrelevant
variables.
Cost of removing k variables
• The expected cost of successfully removing k
variables from n remaining given r relevant
variables is given by

I (n, r, k )  E (n, r, k )  M ( L, n  k )  M ( L, n  k )
 M ( L, n  k )( E  (n, r , k )  1)
where, M(L, n) represents an upper bound on
the cost of running algorithm ‘L’ on n inputs.
Optimal cost of removing irrelevant
variables
• The optimal cost of removing irrelevant
variables from n remaining and r relevant is
given by
I sum (n, r )  min ( I (n, r , k )  I sum (n  k , r ))
k
Optimal value for ‘k’
• The optimal value kopt (n, r ) is computed as
k opt (n, r )  arg min ( I (n, r , k )  I sum (n  k , r ))
k
• It is the value of k for which the cost of
removing variables is optimal.
Algorithms
Algorithm for computing k and cost
values
• Given: L, N, r
• Isum[r+1…N] ← 0
kopt[r+1…N] ← 0
for i ← r+1 to N do
bestCost ← ∞
for k ← 1 to i-r do
temp ← I(i,r,k) + Isum[i-k]
if (temp < bestCost) then
bestCost ← temp
bestK ← k
Isum[i] ← bestCost
kopt[i] ← bestK
Randomized Variable Elimination (RVE)
when r is known
•
•
Given: L,n,r, tolerance
Compute tables for Isum(i,r) and kopt(i,r)
h ← hypothesis produced by L on ‘n’ inputs
•
while n > r do
k ← kopt(n,r)
select k variables at random and remove them
h’ ← hypothesis produced by L on n-k inputs
if e(h’) – e(h) ≤ tolerance then
n ← n-k
h ← h’
else
replace the selected k variables
RVE example
• Plot of expected cost of running RVE(Isum(N,r = 10)) along with cost of
removing inputs individually, and the estimated number of updates
M(L,n).
• L is function that learns a boolean function using perceptron unit.
Randomized Variable Elimination including a search for ‘r’
(RVErS)
•
•
Given: L, c1, c2, n, rmax , rmin , tolerance
Compute tables Isum(i,r) and kopt(i,r) for rmin ≤ r ≤ rmax
r ← (rmax + rmin) / 2
success, fail ← 0
h ← hypothesis produced by L on ‘n’ inputs
•
repeat
k ← kopt(n,r)
select k variables at random and remove them
h’ ← hypothesis produced by L on (n-k) inputs
if e(h’) – e(h) ≤ tolerance then
n←n–k
h ← h’
success ← success + 1
fail ← 0
else
replace the selected k variables
fail ← fail + 1
success ← 0
RVErS (contd…)
if n ≤ rmin then
r, rmax, rmin ← n
else if fail ≥ c1E⁻(n,r,k) then
rmin ← r
r ← (rmax + rmin) / 2
success, fail ← 0
else if success ≥ c2(r – E⁻(n,r,k)) then
rmax ← r
r ← (rmax + rmin) / 2
success, fail ← 0
until rmin < rmax and fail ≤ c1E⁻(n,r,k)
Comparison of RVE and RVErS
Results
Variable Selection results using
naïve Bayes and C4.5 algorithms
Learning Selection
Data Set Algorithm Algorithm
Bayes
r_max = 25
r_max = 75
r_max = 150
k=1
forward
backward
filter
Subset
Evals
Iters
127
293
434
423
14
14
150
Percent
Time
Inputs
Error
(sec)
Search Cost
150 30.3 +- 3.0
0.09
275000
172
22.7 26.9 +- 3.9
19
39700000
293
17.4 26.0 +- 3.3
50
109000000
434
25.6 25.9 +- 2.6
86
202000000
423
23.7 27.0 +- 2.1
85
204000000
2006
13 26.6 +- 2.9
141
154000000
1870
138 30.1 +- 2.6
667 1950000000
150
23.7 27.1 +- 2.1
34
84900000
LED
C4.5
r_max = 25
r_max = 75
r_max = 150
k=1
forward
backward
filter
85
468
541
510
9
61
150
85
468
541
510
1286
7218
150
150 43.9 +- 4.5
51.1 42.0 +- 3.0
25.8 42.5 +- 4.5
25.2 40.8 +- 5.7
32.4 42.5 +- 2.7
7.8 27.0 +- 3.2
90.9 43.5 +- 3.5
7.1 27.3 +- 3.5
0.5
89
363
440
439
196
11481
156
54800
10100000
37000000
44800000
952000
133000000
133000000
16900000
Variable Selection results using naïve
Bayes and C4.5 algorithms
Learning Selection
Data Set Algorithm Algorithm
Bayes
r_max = 15
r_max = 25
r_max = 35
k=1
forward
backward
filter
Subset
Evals
Iters
142
135
132
88
13
19
35
Percent
Time
Inputs
Error
(sec)
Search Cost
35 7.8 +- 2.4
0.02
24000
142
12.6 8.9 +- 4.2
5.9
6470000
135
11.9 10.5 +- 5.8
5.8
6260000
132
11.2 9.8 +- 5.1
5.8
6170000
88
12.3 9.6 +- 5.0
4.6
4670000
382
12.3 7.3 +- 2.9
8.9
10900000
472
18 7.9 +- 4.6
37.5
36300000
35
31.3 7.8 +- 2.6
2
1970000
soybean
C4.5
r_max = 15
r_max = 25
r_max = 35
k=1
forward
backward
filter
118
158
139
117
16
18
35
118
158
139
117
435
455
35
35 8.6 +- 4.0
16.3 9.5 +- 4.6
14.7 10.1 +- 4.1
16.3 9.1 +- 3.7
16.1 9.3 +- 3.5
14.8 9.1 +- 4.0
19.1 10.4 +- 4.4
30.8 8.5 +- 3.7
0.04
13.5
18.6
17.3
14.9
33.7
69
3.7
1210
278000
190000
386000
352000
322000
1750000
60600
My implementation
•
•
•
•
Integrate with Weka
Extend the NaiveBayes and J48 algorithms
Obtain results for some UCI datasets used
Compare results with those reported by
authors
• Work in progress
RECAP
Questions
References
• H. Almuallim and T.G Dietterich. Leraning with many
irrelevant features. In Proceedings of the Ninth National
Conference on Artificial Intelligence, Anaheim, CA, 1991. MIT
Press.
• R. Caruna and D. Freitag. Greedy attribute selection. In
Machine Learning: Proceedings of Eleventh International
Conference, Amherst, MA, 1993. Morgan Kaufmann.
• K. Kira and L. Rendell. A practical approach to feature
selection. In D. Sleeman and P. Edwards, editors, Machine
Learning: Proceedings of Ninth International Conference, San
Mateo, CA, 1992. Morgan Kaufmann.
References (contd…)
• G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and
subset selection problem. In Machine Learning: Proceedings
of Eleventh Internaltional Conference, pages 121-129, New
Brunswick, NJ, 1994. Morgan Kauffmann.
• P.A. Devijver and J. Kittler. Pattern Recognition: A statistical
approach. Prentice Hall/International, 1982
Download