NP-Completeness of Searches for Smallest Possible Feature Sets

advertisement
NP-Completeness of Searches for Smallest Possible Feature Sets
Scott Davies and Stuart Russell
Computer Science Division
University of California
Berkeley, CA 94720
fsdavies,russellg@cs.berkeley.edu
Abstract
A
In many learning problems, the learning system
is presented with values for features that are actually irrelevant to the concept it is trying to
learn. The FOCUS algorithm, due to Almuallim
and Dietterich, performs an explicit search for the
smallest possible input feature set S that permits
a consistent mapping from the features in S to
the output feature. The FOCUS algorithm can
also be seen as an algorithm for learning determinations or functional dependencies, as suggested
in [6]. Another algorithm for learning determinations appears in [7]. The FOCUS algorithm has
superpolynomial runtime, but Almuallim and Dietterich leave open the question of tractability of
the underlying problem. In this paper, the problem is shown to be NP-complete. We also describe briey some experiments that demonstrate
the benets of determination learning, and show
that nding lowest-cardinality determinations is
easier in practice than nding minimal determinations.
Proof of NP-Completeness
Dene the MIN-FEATURES problem as follows: given a
set X of examples (which are each composed of a a binary
value specifying the value of the target feature and a vector
of binary values specifying the values of the other features)
and a number n, determine whether or not there exists
some feature set S such that:
S is a subset of the set of all input features.
S has cardinality n.
There exist no two examples in X that have identical
values for all the features in S but have dierent values
for the target feature.
We show that MIN-FEATURES is NP-complete by reducing VERTEX-COVER to MIN-FEATURES.1 The
VERTEX-COVER problem may be stated as the question:
given a graph G with vertices V and edges E, is there a
subset V of V , of size m, such that each edge in E is
connected to at least one vertex in V ? We may reduce
an instance of VERTEX-COVER to an instance of MINFEATURES by mapping each edge in E to an example in
X, with one input feature for every vertex in V .
0
0
In [8], a \proof" is reported for this result by reduction to set covering. The proof therefore fails to show NPcompleteness.
1
B
F
C
E
D
Figure 1: An example graph for VERTEX-COVER.
Start with one example with 0's for all input features|
one input for every vertex in V|and a 0 for an output.
Call this the \null example." Then, for each edge E in E,
add one example that has a value of 1 for the output and 0's
for all input features except the two inputs corresponding
to the two vertices to which E is connected; for these two
inputs, set the input values to 1's. Call these examples
the \edge-generated" examples. Finally, use the set size
parameter m in the VERTEX-COVER problem as the set
size parameter n for the MIN-FEATURES problem. Figure 1 shows a a graph with labelled vertices, and Figure 2
shows the corresponding MIN-FEATURES problem.
If there is an input set S of size n that satises the generated MIN-FEATURES problem, then there is a vertex
cover V of size m = n that solves the VERTEX-COVER
problem: simply let V contain exactly those vertices that
correspond with the input features contained in S. This
works because for any edge E in the graph, S must contain at least one of the two input features corresponding
to those vertices to which E is connected, since this constraint must have been met in order to prevent the example
generated by E from matching the \null example" on all
the features in S.
Conversely, if there is a vertex set V of size m that
solves an instance of VERTEX-COVER problem, then
i
i
0
0
i
i
i
0
A
B
C
D E
F
X
0
1
0
0
0
0
1
0
1
0
0
0
1
1
0
0
0
0
0
0
1
0
0
0
1
1
0
0
1
1
0
0
1
0
0
0
1
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
1
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
1
1
0
1
1
0
0
The Importance of Relevance
null example
edge−generated
examples using
adjacency matrix
0
Figure 2: The examples for the corresponding MINFEATURES.
there is a solution S of size n = m to the generated MINFEATURES problem|namely, the input set S that con-
tains exactly those input features corresponding to the vertices in V . Since V is a solution, there must be at least
one vertex in V that is connected to any given edge in
the graph G; this means that, for any \edge-generated"
example in the MIN-FEATURES problem, at least one
of the input features that have values of 1 must be in S.
This prevents any \edge-generated" examples from matching the \null example" on all inputs in S, and thus ensures
that S is a solution to the the MIN-FEATURES problem.
Since the generated MIN-FEATURES problem has a solution if and only if the original VERTEX COVER problem has a solution, VERTEX-COVER has been reduced to
MIN-FEATURES; the reduction described above is computable in polynomial time. Thus, MIN-FEATURES is
NP-hard.
MIN-FEATURES is obviously in NP, since we can tell
whether or not any given input feature set S is a solution in
polynomial time by hashing: for each example X , use the
values of the input features of X that are in S to produce
a unique key for the table entry, and store the value of
X 's output feature as the data for the table entry. If
two entries have the same key but dierent data, S is not
solution; otherwise, it is a solution.
Therefore, MIN-FEATURES is NP-complete. Not only
does the problem of nding the smallest relevant feature
set S appear intractable, but we cannot even determine
the minimum size of such a set.
It is important to note that the proof that MINFEATURES is NP-complete relies on the generation of
problem instances in which all features are relevant. In
general, the search for the smallest possible set will take
time non-polynomial (unless P = NP) in the number of input features; however, if it turns out that only a few inputs
are actually relevant, our reduction does not say that any
algorithm must take time that is non-polynomial in the total number of features. Hence, while the general problem
is intractable, it may still be feasible to use a search along
the lines we have been describing if we suspect that only
a logarithmic fraction of the inputs are relevant|we can
order the search from smaller sets to larger sets, and abort
the search if we were mistaken in our suspicions.
0
0
0
i
i
Our algorithm for identifying and using relevant features
is as follows. As the examples are processed, we maintain
a list L of all the possible feature sets S, of the lowest possible cardinality n, such that any examples seen so far that
match on the features in S have identical output feature
values. Every time we encounter a new example, we check
the validity of each set S in L (by using a hash table as described above), throwing out those sets that are no longer
valid. If the list consequently becomes empty, we increase
n by 1, and add to the list all feature sets S of the new
cardinality n = n + 1 that are now valid.
When asked to predict an output for a given input vector, an algorithm M maintaining the list L may pick a
random set S from L, pass on the features in S to any
other machine learning algorithm M , and return the value
returned by M . Alternatively, it may call M once for
every set S in L and return the value predicted most often, or return a distribution based on the values predicted.
This takes more time, however, and although the resulting
learning curves are more predictable, they often have the
same averages.
Using a basic implementation of Quinlan's ID3 decisiontree learning algorithm [5] for M , we tested the above algorithm on random boolean functions of 16 boolean variables, only 5 of which were relevant. Three trials were
performed, each with a dierent random function; for each
trial, up to 150 of the 3200 examples generated were used
for training, while the rest were used for testing. Figure 3
shows the resulting learning curve, and contrasts it with
the learning curve that was generated when ID3 was used
directly on all of the feature values of the same three data
sets. It is clear that the system learned much more quickly
when we used a FOCUS-like algorithm to lter out features
which could conceivably be irrelevant.
Some papers ([7], for example) have put forward algorithms for determining all feature sets S that are \minimal" in the sense that any proper subset of S would be
insucient to permit a consistent mapping from the features in the proper subset to the output feature. The list
0
0
0
0
i
1
% correct on test set
0.9
With determination learning
Plain ID3
0.8
0.7
0.6
0.5
0.4
0
20
40
60
80
100
Training set size
120
140
Figure 3: Learning curves for ID3 with feature set selection
versus raw ID3 (sixteen features, ve relevant).
sets for discrete data becomes completely tractable [3]. If
it is known that the function being learned is of a certain class (e.g., K-CNF or K-DNF), then the problem once
again becomes tractable [2].
900
800
% correct on test set
700
References
600
500
[1] Almuallim, H. and Dietterich, T. (1991) \Learning
with Many Irrelevant Features." In Proc. AAAI-91.
[2] Blum, A. 1990. \Learning Boolean Functions in an
Innite Attribute Space," Proceedings of the Twenty-
400
300
200
Minimal determinations
Lowest-cardinality determinations
Second Annual ACM Symposium on Theory of Computing, pp. 64-72. Baltimore, MD.
100
0
0
20
40
60
80
100
Training set size
120
140
Figure 4: Number of \minimal" vs. number of lowestcardinality sets
L maintained by the algorithm above contains a subset
of these \minimal" sets, i.e., the ones of the lowest possible cardinality. Empirically, this list L is often much
smaller than the list of all \minimal" feature sets, and is
therefore much quicker to update when a new example is
presented to the system. Figure 4 shows the average number of \minimal" sets that existed during the generation
of the learning curve in Figure 3, and contrasts this with
the much smaller number of sets that were actually of the
lowest possible cardinality. Regardless of the method by
which the minimal sets were used to form a feature set for
ID3, we found that the inductive performance of the two
approaches was more or less the same.
It is interesting to note that the minimum-cardinality
approach settled on a single determination in all three trials after about 30 examples|the same point at which the
learning curve shows a dramatic improvement. This suggests strongly that learning the determination, in addition
to the decision tree itself, is an important part of inductive
performance.
Conclusion
FOCUS-type algorithms are fairly simple to implement,
and can signicantly improve the performance of machine
learning programs faced with certain types of problems. In
general, however, their approach to searching for relevant
features appears to be intractable. Furthermore, they must
be signicantly changed if they are to work properly for
noisy or oating-point functions. John et al. [4] suggest
that it may be better to select a feature set that yields
the best inductive performance by M (as measured by
cross-validation), rather than simply identifying an exactly
relevant feature set. This approach seems promising, and
more likely to tolerate noise.
The search for smallest consistent feature sets becomes
tractable if we are allowed to make certain assumptions
about the domains in which the machine learning system is
working. If the learning system is allowed to make queries
about the outputs corresponding to arbitrary input vectors, then the determination of smallest possible feature
0
[3] Blum, A.; Hellerstein, L.; and Littlestone, N. 1991.
\Learning in the Presence of Finitely or Innitely
Many Irrelevant Attributes," Proceedings of the
Fourth Annual Workshop on Computational Learning
Theory. Santa Cruz, CA: Morgan Kaufman.
[4] John, G. H., Kohavi, R., and Peger, K. 1994. \Irrelevant features and the subset selection problem." In
Proceedings of ML-94. New Brunswick, NJ: Morgan
Kaufmann.
[5] Quinlan, R. 1986. \Induction of Decision Trees," Machine Learning, Kluwer Academic, 1:1.
[6] Russell, S. J. (1989) The Use of Knowledge in Analogy
and Induction. London: Pitman Press.
[7] Schlimmer, J. (1993) \Eciently inducing determinations: A complete and systematic search algorithm
that uses optimal pruning." In Proc. ML-93.
[8] Wong, S. K. M., and Ziarko, W. (1985) \On optimal decision rules in decision tables." Bull. Polish
Acad. Sci. (Math.), 33(11-12), 693-696.
Download