From: AAAI Technical Report FS-94-02. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Notes on Learning with Irrelevant PAC Model Aditi Dhagat Department of EECS University of Wisconsin P.O. Box 784 Milwaukee, WI 53201 aditi@cs.uwm.edu Attributes in the Lisa Hellerstein* Department of EECS Northwestern University 2145 Sheridan Road Evanston, IL 60208-3118 hstein@eecs.nwu.edu Introduction In these notes, we sketch some of our work on learning with irrelevant attributes in Valiant’s PAC model [V84]. In the PAC model, the goal of the learner is to produce an approximately correct hypothesis from random sample data. If the number of relevant attributes in the target function is small, it mayhe desirable to produce a hypothesis that also depends on only a small number of variables. Our work is theoretical, but has real-life analogues. For example, suppose we are trying to determine which combinations of symptoms indicate that a person will develop a certain disease. If only a few symptomsare relevant, we would like our learning algorithm to produce a hypothesis which depends only a few symptoms. If we use such a hypothesis to predict whether future patients will develop the disease, we will only have to test for a few symptoms (and Hillary Rodham Clinton will be happy). In the PAClearning model, the object of the learner is to find (with "high" probability) "good" approximation to a hidden target concept c. The learner is given a sample consisting of labeled examples (a,l) of the target concept (i.e. Boolean function) c. Here a is an element of the domain of c, and l = c(a). The labeled examples are drawn independently according to a fixed but unknowninput distribution on the domain. The concepts we consider are expressible by functions of boolean attributes (variables). The domain elements a are Boolean assignments to *Supported in part by NSFgrant CCR-92-10957 48 variables {xl,...,zn}. Suppose c is a Booolean function on {xl,..., x,~). Wesay that a variable xi is irrelevant to c if, given any assignmenta, the value of c(a) is independent of the value a gives to xi. Moreprecisely, xi is irrelevant if the following ~ holds: given any assignment a setting xi to 1, if a is obtained from a by changing the value of xi to 0, then c(a) = c(a’). Note that if c is expressed by a Boolean formula in which xi does not appear, then xi is irrelevant to c. If a variable xi is not irrelevant, then it is relevant. Haussler [H88] addressed the problem of learning monomials in Valiant’s PAC learning model [V84]. A monomialis a conjunction zl Az2 A ...Azk whereeach zi is either a variable, xj, or the negation of a variable, ~j. In particular, Haussler considered the problem of learning monomials in which only a small subset of {xl,..., x~} appear, and hence manyof the variables are irrelevant. He developed an algorithm for this problem that is a simple application of the standard greedy set cover approximation algorithm. Given any sample of a monomialon r variables, Haussler’s algorithm finds a consistent monomialon r(ln q+l) variables, where q is the number of negative examples in the sample (a negative example is one in which l = ). Good PAClearning algorithms do not necessarily require perfect consistency with the sample. A variation of Haussler’s algorithm, suggested by Warmuth, outputs a monomial on O(rlog(1/e)) variables that may be inconsistent with a fraction of size up to ~ of the sample (see [KV94]pp. 4142). The sample complexity of the algorithm is O(~ (log ~ + r log n log(~))). That is, there is ber s of this order such that given a sample of size s, the algorithm, with high probability, will output a good approximation to the hidden monomial. More precisely, it will output a monomial that, with probability at least 1 - 6 (i.e. "high" probability), misclassifies a fraction of size at most e of the input distribution (i.e. is a "good" approximation). The sample complexity of this algorithm is better than the sample complexity of Haussler’s original algorithm. Note that the size of the monomialoutput by the algorithm is only a factor log(i/e) larger than the hidden monomial. It is independent of n, the total numberof relevant and irrelevant variables. The sample complexity is linear in r and depends only logarithmically on n. Weextend the above results by presenting a PAC algorithm for learning k-term DNFformulas on a small number of variables. A k-term DNF formula is a disjunction of at most k terms (monomials). Given a sample of hidden k-term DNF formula on r variables, our algorithm outputs a consistent DNFformula on only O(rk log k q) variables, where q is the number of negative examples in the sample. Like Haussler’s algorithm, our algorithm can be modified so that it outputs a hypothesis that is inconsistent with a fraction of no more than ~ of the sample. The modified algorithm is the basis for the following theorem. Theorem 1 There is an algorithm that PAC learns the class of k-term DNFformulas with at most r relevant variables from {Xl,...,Xn}, that ouputs a DNFformula with ¯ O(r terms,over ¯ O(rk logk(r/e)) relevant variables. The sample complexity of the algorithm is bounded by s =O( ~ (log l + rk log n logk ( r ) and it runs in time bounded by the polynomial t = O(snk). Weconsider k to be a constant in giving our bounds. Note that in all the expressions in the above theorem, the dependence on r is polynomial, but the dependence on n is logarithmic. Wesketch our algorithm below in Section . The class of k-term DNFformulas on r relevant variables could be PAClearned by (an algorithm outputing) k-CNFformulas on O(rk log(l/e)) variables, using a simple variation of Haussler’s mono- 49 mial algorithm. However, our algorithm has the advantage that it outputs a DNFformula. PAC learning k-term DNF by k-term DNF (or even by,,~ 2k-term DNF) is NP-hard [PV88]. Blum and Singh [BS90] showed that it is possible to learn k-term DNFby DNF, but the hypothesis they output has O(nk-l) terms and k) O(n size. Whether k-term DNFcan be learned by o(n k-l) terms remains open. 1 It is not even known whether a 2-term DNFcan be learned by o(n)-term DNF. Our theorem shows that if the target k-term DNFconcept is known to have r relevant variables, then it is possible to PAClearn k-term DNF by a DNF which has a number of terms that depends polynomially on r, but only poly-logarithmically on n. In a longer paper on our work, we also present an algorithm for PAClearning Boolean decision lists with k alternations. A Boolean decision list has the form if 11 then return bl else if 12 then return b2 else if lm then return bm else return bm+l. where each li is either a variable, xi, or its negation, xi, and each bi is either 0 or 1. A decision list has k alternations if the sequence {bl,..., bm} changes from 0 to 1 or from 1 to 0 a total of k times (note: this is not equivalent to the k-decision list defined in [R87]). Given a sample of a decision list with k alternations, containing r relevant variables, our algorithm outputs a consistent decision list with k alternations containing O(rk logk m) variables, where m is the size of the sample. Again, a modified version of the algorithm produces an output which is not quite consistent. The modified algorithm forms the basis for the following theorem: Theorem 2 There is an algorithm that PAC learns the class of k-alternation decision lists over { xt , . . . , xn } variableswith r relevant variablesby the class of k-alternation decision lists over ¯ O(rk logk(r/e)) variables. aLearning k-term DNFby small DNFformulas is closely related to the problem of graph colorability for k-colorable graphs [PV88]. It is knownthat, unless P=NP,graph colorability cannot be approximated with ratio n~ (for a particular e > 0) [LY93].However, is not knownwhetherthis holds for k-colorable graphs. The sample complexity of the algorithm is bounded by the polynomial 8=O( 1 (log~+1 rk lognlogk(r))) and it runs in time bounded by the polynomial t = O(sn2). An algorithm to PAC learn k-term DNFin the presence of irrelevant variables Wedescribe our algorithm for PAC learning kterm DNFin the presence of irrelevant variables. The algorithm takes as input sets P and N of positive and negative examples of a k-term DNF. The algorithm assumes that it knows r, the number of relevant variables in the hidden k-term DNF(if this isn’t the case, the algorithm can be repeated with decreasing values of r). It outputs a DNF formula that is consistent with P and N. Our algorithm uses as a subroutine a well-known greedy approximation algorithm for the set cover problem [J74; L75]. The set cover problem takes as input a collection S of subsets of a finite set U, and asks for a subcollection S~ of subsets of S, such that the union of the subsets in S~ ~ is U, and S contains the smallest possible number of subsets. The problem is NP-complete [K72]. The greedy approximation algorithm constructs a cover of size z(ln IUI + 1), where z is the size of the smallest cover. To introduce the algorithm, we first discuss the easier problem of learning 2-term DNFformulas. Consider the problem of learning a 2-term DNF formula f = tl Vt2 containing r relevant variables. Let P and N be the given set of positive and negative examples. Let l be a literal that appears one of the terms, say t2. Consider any positive example that sets l to 0. That examplemust satisfy tl. Thus if Pt is the set of positive examplessetting l to 0, then Pl and N are consistent with the formula f~ = tl. Moreover, if we run the approximation algorithm for the minimumconsistent monomial problem on Pl and N, then the output will be a monomialfz of size r(lnq + 1), where q = IXl. For every l, we form Pt and N. We run the approximation algorithm for the minimumconsistent monomialproblem. If that algorithm outputs a consistent monomiM ft of size at most r(ln q+ 1), then I is designated a candidate (for inclusion in f). Note that every literal that is actually in f will be designated a candidate. Wewould like to find one of the terms in f, but as we cannot, we instead generate a pseudo-term of f, as follows. Note that tl has the property that every negative example sets at least one literal in tl to 0. Thus there exists a set of at most r candidates such that each negative example in N sets at least one of those r literals to 0. Using greedy set cover, we can find a set L~ of at most r(ln q + 1) candidates such that each negative example in N sets at least one literal in L~ to 0. (Morespecifically, we run greedy set cover with U equal to N and S equal to the set of subsets Sz, one for each candidate l, such that St = {x E Nix sets l to 0 }). We form a term M from the literals in L~. We then form a hypothesis g which is the disjunction of Mand all ft such that l is in M. There is no guarantee that Mis a term of f, nor that it even includes any literals in f. Nevertheless, we can show that g is consistent with P and N. By the construction of fl, no negative examplein N satisfies an fz. Moreover, since each negative example sets at least one literal in Mto 0, no negative example satisfies N. Thus g is consistent with N. Consider any positive example a in P. If a sets a literal l in Mto 0, then a satisfies ft and hence g. Otherwise, a sets all literals I in Mto 1, and hence a satisfies Mand therefore g. Thus g is consistent with P as well. The numberof literals in g is easily seen to be at most O(r2 log2 q). Our algorithm for learning k-term DNFfor k > 2 is recursive, and based on the technique just described. The algorithm takes as input k and the sets P and N. Let f = tl V... Vtk be the target concept which has r relevant variables. The base case, k = 1, consists of using greedy set cover to try and find a monomialof size at most r(log q+ 1) consistent with P and N. If no such monomial exists, the algorithm returns "Fail". For k > 1, the algorithm takes each literal l, and forms the set Pl of positive examples in P setting l to 0. It recursively runs the algorithm on inputs Pl, N, and k - 1. If the algorithm succeeds and returns a formula (which it will if I actually appears in a term), then l is designated a candidate. After determining all the candidates, the algorithm uses greedy set cover to find a set L~ of candidates of size at most r(ln q + 1) such that each negative example sets at least one of the literals in the set to 0. If no such term exists, the algorithm returns "Fail". Otherwise, it forms a hypothesis g consisting of the disjunction of M(the term formed from the literals in L’) and the DNFformulas fl, for all l in M. 5O Full pseudocode, a formal proof of correctness, and a complexity analysis appear in a longer paper on this work. References M. Anthony, N. Biggs. Computational Learning Theory. Cambridge University Press, 1992. A. Blumer, A. Ehrenfeucht, D. Haussler and M. Warmuth. Occam’s Razor. Information Processing Letters, 24: 377-380, 1987. A. Blum and M. Singh. Learning functions of k terms. In Proceedings of the 1990 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. D. Haussler. Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artificial Intelligence, 36(2):177-222, 1988. D. S. Johnson. Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences, 9:256-278, 1974. R. Karp. Reducibility among combinatorial problems. Complexity of Computer Computations, Plenum Press, NewYork, 85-103, 1972. M. Kearns and U. Vazirani. Introduction to Computational Learning Theory, MITPress, 1994. L. Lov£sz. On the ratio of optimal integral and fractional covers. Discrete Mathematics, 13:383390, 1975. C. Lund and M. Yannakakis. On the Hardness of Approximating Minimization Problems. In Proceedings of the 25th Annual ACMSymposium on Theory of Computing, 1993. L. Pitt and L.G. Valiant. Computational limitations from learning from examples. Journal of the ACM,35(4): 965-984, 1988. R. L. Rivest. Learning decision lists. Learning, 2(3):229-246, 1987. Machine L. G. Valiant. A theory of the learnable. Communications of the ACM,27(11):1134-1142, 1984. 51