Notes on Learning with Irrelevant Attributes

advertisement
From: AAAI Technical Report FS-94-02. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.
Notes on Learning
with Irrelevant
PAC Model
Aditi
Dhagat
Department
of EECS
University of Wisconsin
P.O. Box 784
Milwaukee,
WI 53201
[email protected]
Attributes
in the
Lisa Hellerstein*
Department
of EECS
Northwestern
University
2145 Sheridan
Road
Evanston,
IL 60208-3118
[email protected]
Introduction
In these notes, we sketch some of our work on
learning with irrelevant attributes in Valiant’s
PAC model [V84]. In the PAC model, the goal
of the learner is to produce an approximately correct hypothesis from random sample data. If the
number of relevant attributes in the target function is small, it mayhe desirable to produce a hypothesis that also depends on only a small number of variables. Our work is theoretical, but has
real-life analogues. For example, suppose we are
trying to determine which combinations of symptoms indicate that a person will develop a certain
disease. If only a few symptomsare relevant, we
would like our learning algorithm to produce a hypothesis which depends only a few symptoms. If
we use such a hypothesis to predict whether future
patients will develop the disease, we will only have
to test for a few symptoms (and Hillary Rodham
Clinton will be happy).
In the PAClearning model, the object of the
learner is to find (with "high" probability)
"good" approximation to a hidden target concept
c. The learner is given a sample consisting of labeled examples (a,l) of the target concept (i.e.
Boolean function) c. Here a is an element of the
domain of c, and l = c(a). The labeled examples
are drawn independently according to a fixed but
unknowninput distribution on the domain.
The concepts we consider are expressible by
functions of boolean attributes (variables). The
domain elements a are Boolean assignments to
*Supported in part by NSFgrant CCR-92-10957
48
variables {xl,...,zn}.
Suppose c is a Booolean
function on {xl,..., x,~). Wesay that a variable
xi is irrelevant to c if, given any assignmenta, the
value of c(a) is independent of the value a gives to
xi. Moreprecisely, xi is irrelevant if the following
~
holds: given any assignment a setting xi to 1, if a
is obtained from a by changing the value of xi to
0, then c(a) = c(a’). Note that if c is expressed by
a Boolean formula in which xi does not appear,
then xi is irrelevant to c. If a variable xi is not
irrelevant, then it is relevant.
Haussler [H88] addressed the problem of
learning monomials in Valiant’s PAC learning
model [V84]. A monomialis a conjunction zl Az2 A
...Azk whereeach zi is either a variable, xj, or the
negation of a variable, ~j. In particular, Haussler
considered the problem of learning monomials in
which only a small subset of {xl,..., x~} appear,
and hence manyof the variables are irrelevant. He
developed an algorithm for this problem that is
a simple application of the standard greedy set
cover approximation algorithm. Given any sample
of a monomialon r variables, Haussler’s algorithm
finds a consistent monomialon r(ln q+l) variables,
where q is the number of negative examples in the
sample (a negative example is one in which l =
).
Good PAClearning algorithms do not necessarily require perfect consistency with the sample.
A variation of Haussler’s algorithm, suggested by
Warmuth, outputs a monomial on O(rlog(1/e))
variables that may be inconsistent with a fraction
of size up to ~ of the sample (see [KV94]pp. 4142). The sample complexity of the algorithm is
O(~ (log ~ + r log n log(~))). That is, there is
ber s of this order such that given a sample of size
s, the algorithm, with high probability, will output a good approximation to the hidden monomial. More precisely, it will output a monomial
that, with probability at least 1 - 6 (i.e. "high"
probability), misclassifies a fraction of size at most
e of the input distribution (i.e. is a "good" approximation). The sample complexity of this algorithm is better than the sample complexity of
Haussler’s original algorithm. Note that the size
of the monomialoutput by the algorithm is only a
factor log(i/e) larger than the hidden monomial.
It is independent of n, the total numberof relevant
and irrelevant variables. The sample complexity is
linear in r and depends only logarithmically on n.
Weextend the above results by presenting a
PAC algorithm for learning k-term DNFformulas on a small number of variables. A k-term DNF
formula is a disjunction of at most k terms (monomials). Given a sample of hidden k-term DNF
formula on r variables, our algorithm outputs a
consistent DNFformula on only O(rk log k q) variables, where q is the number of negative examples
in the sample. Like Haussler’s algorithm, our algorithm can be modified so that it outputs a hypothesis that is inconsistent with a fraction of no more
than ~ of the sample. The modified algorithm is
the basis for the following theorem.
Theorem 1 There is an algorithm
that PAC
learns the class of k-term DNFformulas with at
most r relevant variables from {Xl,...,Xn}, that
ouputs a DNFformula with
¯ O(r
terms,over
¯ O(rk logk(r/e)) relevant variables.
The sample complexity of the algorithm is bounded
by
s =O( ~ (log l + rk log n logk ( r )
and it runs in time bounded by the polynomial t =
O(snk).
Weconsider k to be a constant in giving our
bounds. Note that in all the expressions in the
above theorem, the dependence on r is polynomial,
but the dependence on n is logarithmic. Wesketch
our algorithm below in Section .
The class of k-term DNFformulas on r relevant
variables could be PAClearned by (an algorithm
outputing) k-CNFformulas on O(rk log(l/e)) variables, using a simple variation of Haussler’s mono-
49
mial algorithm. However, our algorithm has the
advantage that it outputs a DNFformula.
PAC learning k-term DNF by k-term DNF (or
even by,,~
2k-term DNF) is NP-hard [PV88].
Blum and Singh [BS90] showed that it is possible
to learn k-term DNFby DNF, but the hypothesis they output has O(nk-l) terms and k)
O(n
size. Whether k-term DNFcan be learned by
o(n k-l) terms remains open. 1 It is not even
known whether a 2-term DNFcan be learned by
o(n)-term DNF. Our theorem shows that if the
target k-term DNFconcept is known to have r
relevant variables, then it is possible to PAClearn
k-term DNF by a DNF which has a number of
terms that depends polynomially on r, but only
poly-logarithmically on n.
In a longer paper on our work, we also present
an algorithm for PAClearning Boolean decision
lists with k alternations. A Boolean decision list
has the form
if 11 then return bl
else if 12 then return b2
else if lm then return bm
else return bm+l.
where each li is either a variable, xi, or its negation, xi, and each bi is either 0 or 1. A decision
list has k alternations if the sequence {bl,..., bm}
changes from 0 to 1 or from 1 to 0 a total of k times
(note: this is not equivalent to the k-decision list
defined in [R87]).
Given a sample of a decision list with k alternations, containing r relevant variables, our algorithm outputs a consistent decision list with
k alternations containing O(rk logk m) variables,
where m is the size of the sample. Again, a modified version of the algorithm produces an output
which is not quite consistent. The modified algorithm forms the basis for the following theorem:
Theorem 2 There is an algorithm
that PAC
learns the class of k-alternation decision lists over
{ xt , . . . , xn } variableswith r relevant variablesby
the class of k-alternation decision lists over
¯ O(rk logk(r/e)) variables.
aLearning k-term DNFby small DNFformulas is
closely related to the problem of graph colorability
for k-colorable graphs [PV88]. It is knownthat, unless P=NP,graph colorability cannot be approximated
with ratio n~ (for a particular e > 0) [LY93].However,
is not knownwhetherthis holds for k-colorable graphs.
The sample complexity of the algorithm is bounded
by the polynomial
8=O(
1
(log~+1 rk lognlogk(r)))
and it runs in time bounded by the polynomial t =
O(sn2).
An algorithm to PAC learn k-term
DNFin the presence of irrelevant
variables
Wedescribe our algorithm for PAC learning kterm DNFin the presence of irrelevant variables.
The algorithm takes as input sets P and N of positive and negative examples of a k-term DNF. The
algorithm assumes that it knows r, the number of
relevant variables in the hidden k-term DNF(if
this isn’t the case, the algorithm can be repeated
with decreasing values of r). It outputs a DNF
formula that is consistent with P and N.
Our algorithm uses as a subroutine a well-known
greedy approximation algorithm for the set cover
problem [J74; L75]. The set cover problem takes
as input a collection S of subsets of a finite set
U, and asks for a subcollection S~ of subsets of S,
such that the union of the subsets in S~ ~
is U, and S
contains the smallest possible number of subsets.
The problem is NP-complete [K72]. The greedy
approximation algorithm constructs a cover of size
z(ln IUI + 1), where z is the size of the smallest
cover.
To introduce the algorithm, we first discuss the
easier problem of learning 2-term DNFformulas.
Consider the problem of learning a 2-term DNF
formula f = tl Vt2 containing r relevant variables.
Let P and N be the given set of positive and negative examples. Let l be a literal that appears one
of the terms, say t2. Consider any positive example that sets l to 0. That examplemust satisfy tl.
Thus if Pt is the set of positive examplessetting l
to 0, then Pl and N are consistent with the formula
f~ = tl. Moreover, if we run the approximation
algorithm for the minimumconsistent monomial
problem on Pl and N, then the output will be a
monomialfz of size r(lnq + 1), where q = IXl.
For every l, we form Pt and N. We run the
approximation algorithm for the minimumconsistent monomialproblem. If that algorithm outputs
a consistent monomiM
ft of size at most r(ln q+ 1),
then I is designated a candidate (for inclusion in f).
Note that every literal that is actually in f will be
designated a candidate.
Wewould like to find one of the terms in f, but
as we cannot, we instead generate a pseudo-term
of f, as follows. Note that tl has the property that
every negative example sets at least one literal in
tl to 0. Thus there exists a set of at most r candidates such that each negative example in N sets
at least one of those r literals to 0. Using greedy
set cover, we can find a set L~ of at most r(ln q + 1)
candidates such that each negative example in N
sets at least one literal in L~ to 0. (Morespecifically, we run greedy set cover with U equal to N
and S equal to the set of subsets Sz, one for each
candidate l, such that St = {x E Nix sets l to 0
}).
We form a term M from the literals
in L~. We
then form a hypothesis g which is the disjunction
of Mand all ft such that l is in M. There is no
guarantee that Mis a term of f, nor that it even
includes any literals in f. Nevertheless, we can
show that g is consistent with P and N. By the
construction of fl, no negative examplein N satisfies an fz. Moreover, since each negative example
sets at least one literal in Mto 0, no negative example satisfies N. Thus g is consistent with N.
Consider any positive example a in P. If a sets a
literal l in Mto 0, then a satisfies ft and hence g.
Otherwise, a sets all literals I in Mto 1, and hence
a satisfies Mand therefore g. Thus g is consistent
with P as well. The numberof literals in g is easily
seen to be at most O(r2 log2 q).
Our algorithm for learning k-term DNFfor k >
2 is recursive, and based on the technique just described. The algorithm takes as input k and the
sets P and N. Let f = tl V... Vtk be the target
concept which has r relevant variables. The base
case, k = 1, consists of using greedy set cover to
try and find a monomialof size at most r(log q+ 1)
consistent with P and N. If no such monomial exists, the algorithm returns "Fail". For k > 1, the
algorithm takes each literal l, and forms the set
Pl of positive examples in P setting l to 0. It recursively runs the algorithm on inputs Pl, N, and
k - 1. If the algorithm succeeds and returns a formula (which it will if I actually appears in a term),
then l is designated a candidate. After determining all the candidates, the algorithm uses greedy
set cover to find a set L~ of candidates of size at
most r(ln q + 1) such that each negative example
sets at least one of the literals in the set to 0. If
no such term exists, the algorithm returns "Fail".
Otherwise, it forms a hypothesis g consisting of
the disjunction of M(the term formed from the
literals in L’) and the DNFformulas fl, for all l in
M.
5O
Full pseudocode, a formal proof of correctness,
and a complexity analysis appear in a longer paper
on this work.
References
M. Anthony, N. Biggs. Computational Learning
Theory. Cambridge University Press, 1992.
A. Blumer, A. Ehrenfeucht, D. Haussler and M.
Warmuth. Occam’s Razor. Information Processing Letters, 24: 377-380, 1987.
A. Blum and M. Singh. Learning functions of k
terms. In Proceedings of the 1990 Workshop on
Computational Learning Theory. Morgan Kaufmann, San Mateo, CA.
D. Haussler. Quantifying inductive bias: AI
learning algorithms and Valiant’s learning framework. Artificial Intelligence, 36(2):177-222, 1988.
D. S. Johnson. Approximation algorithms for
combinatorial problems. Journal of Computer
and System Sciences, 9:256-278, 1974.
R. Karp. Reducibility among combinatorial problems. Complexity of Computer Computations,
Plenum Press, NewYork, 85-103, 1972.
M. Kearns and U. Vazirani. Introduction to Computational Learning Theory, MITPress, 1994.
L. Lov£sz. On the ratio of optimal integral and
fractional covers. Discrete Mathematics, 13:383390, 1975.
C. Lund and M. Yannakakis. On the Hardness of
Approximating Minimization Problems. In Proceedings of the 25th Annual ACMSymposium on
Theory of Computing, 1993.
L. Pitt and L.G. Valiant. Computational limitations from learning from examples. Journal of the
ACM,35(4): 965-984, 1988.
R. L. Rivest. Learning decision lists.
Learning, 2(3):229-246, 1987.
Machine
L. G. Valiant. A theory of the learnable. Communications of the ACM,27(11):1134-1142, 1984.
51
Download