From: AAAI-86 Proceedings. Copyright ©1986, AAAI (www.aaai.org). All rights reserved. INFERENCE INDUCTIVE BY REFINEMENT P. D. Laird* of Computer Department Yale New University Haven, Abstract that modify a hypothesis by making it more general or more specific in response to examples. The separate effects of the syntax (rule space) and semantics, and the relevant orderings on these, are precisely specified. Relations called refinement operators are defined, one for generalization and one for specialization. General and particular properties of these relations are considered, and algorithm schemas for top-down and bottom-up inference difficulties common to refinement Ct., 06520 The essential features of this simple illustration apply to many inductive learning problems in a variety of domains: formal languages and automata (e.g., [Angluin, 19821, [CrespiReghizzi, 19721); programming languages (e.g., [Hardy, 19751, A model is presented for the class of inductive inference problems that are solved by refinement algorithms - that is, algorithms are given. Finally, are reviewed. Science algorithms [Shapiro, 19821, [Summers, 19771 ); functions and ([Hunt et al., 19661, [Langley, 19801); propositional icate logic ([Michalski, 19751, [Shapiro, 19811, [Valiant, and a variety of representations specific to a particular (e.g., [Feigenbaum, 19631, [Winston, 19751). to explain Introduction The topic of this paper is the familiar problem of inductive learning: determining a rule from examples. Humans exhibit a strikto solve this problem in a variety of situations - to that it is difficult to believe that a separate algorithm is at work in each case. Hence, implementing real systems that challenge of identifying sort of learning. As an illustration concept-learning jects have three low, blue), ordered pair determined. represents object”. of the basic problem attributes: and shape in addition to the problems of learn by example, there is the fundamental principles ideas, that consider underlie circle). A concept consists Obyelof an of objects, possibly with some attributes left unFor example, C = {(large ? circle), (small ? ?)} the concept “a large circle of any color, and any small There is a most-general concept ( { (? ? ?), (? ? ?)}) and a large number (144) o f most-specific concepts (both objects fully specified). Examples, or training instances, can be positive or negative: {(large blue circle), (small blue triange)} is a positive example of the concept C above, whereas {(large red triangle), (large blue circle)} is a negative example. If the current hypothesis excludes a positive example, the inference procedure must generalize it; and if it includes a negative example, the procedure must make it more specific. Every domain has rules for making a hypothesis more or less general; here, a concept can be generalized by changing an attribute from by the inverse operation. specific value to ‘?‘, or specialized Work funded in part by the National MCS8002447 and DCR8404226. l 472 / SCIENCE Science Foundation, under Grants a of examples and a space of rules rich enough any set of examples. Given some examples and a set of possible generalize the hypotheses that fail to explain hypotheses, positive ex- amples, amples. negative and specialize If possible, represent to express the rules. hypotheses examples that imply in the same language ex- used Our goal is to present a formal model to unify many of the ideas common to these domains. The value of such a formalism is that the following (adapted from [Mitchell, 19821). size (large, small), color (red, (triangle, this 19841)) domain From the experiences of many researchers (see, for example, [Angluin and Smith, 19831, [Banerji, 19851, [Cohen, 19823, [Michalski, 19831 f or summaries), a number of general guidelines have been suggested: Define a space ing ability the extent sequences and pred- the essential projected application rithms constructed, from first properties features of the can be identified without the need inductive component of a quickly, and basic to rediscover these algoideas principles. Another advantage is that the abstract and limitations common to algorithms based on this model can details of a particular be identified and studied without reference to the application. This report is necessarily brief, with only the outlines of the principal concepts given, plus examples to illustrate their application. More details, examples, and proofs are available in the full report ([Laird, Inductive Definition nents: 19851). Inference Problems 1 An inductive ordered inference problem l A partially set (D, >), called l A set E of expressions over a finitely called the syntactic domain. has six compo- the semantic presented domain. algebra, A mapping h: E + D such some expression e in E. A designated that do of D, called element d in D is h(e) for every the target object. An oracle, EX, for “examples” of do, in the form of signed expressions in E. If EX() returns +e, then do 2 h(e), and if EX() returns -e, then do 2 h(e). An oracle for the partial 1 if h(el) returns The (>?) examples order, such that below will help are simply elements over some algebra name every possible target ciation between rules and this definition seem etc.); according a new syntax less ab- of inductive of some set; of a partial ordering. Rules which is expressive enough to object. The mapping h is the assothe objects they denote. Note that h may map more than one expression to the same ement; if h(el) = h(ez), th en ei and ez are called rules. There may, of course, be different syntactical tions of one semantic domain (grammars, automata, ioms, to this model, the problem semantic elh-equivalent representalogical axchanges when & is adopted. Examples are elements of the same set & of expressions: i.e., every expression in E is potentially an example - positive in case its semantic and negative representation is no greater than than the target, otherwise. In practice, the set of examples is often limited to a subset of the expressions EX represents the mechanism which target; in actuality, examples dom source, a query-answering Note that The concerned ple. In easy to nore the EX depends of inductive Example the set (see below). The oracle produces examples of the may come from a teacher, a ransource, or combinations of these. on the target inference. 1 Let X = {xi,. of subsets the funcintegers to integers. If P is such a program, then h(P) is the partial function computed by P. Consider the function f(x) = x2. A (LAMBDA (X) (COND ((= positive example of f is the program X 2) 4))). Intuitively, this example states that out giving any other values of f. The problem two arbitrary programs recursively unsolvable. Pi and P2 whether f (2) = 4 withof deciding for Pr > Pz is, of course, of X. . . ,xt} be a finite For example, The general inductive inference problem allows any expresin E to be an example. More often we are limited to a subset of E for examples (e.g., in Example 2 above, we may be given only programs defined on a single integer rather than arsion bitrary programs). But what property guarantees of E .is sufficient to identify a target uniquely? Definition 2 Let S be a set of examples e is said to agree with S if for every h(e) 2 h(e+), and for every negative that a subset of do. An expression positive example e+ in S, example e- in S, h(e) 2 h(e-). Definition 3 A suficient set of examples of do is a signed set S of E with the property that all expressions that agree S are h-equivalent Example by h to do. and are mapped 3 In Example 1 above, subwith a sufficient set of examples for any target can be formed from only the t minterms with exactly one uncomplemented variable. In Example 2, a sufficient set of examples can be constructed from programs of the form: X i> j 1))) where i and j are integer (LAMBDA (X) (COND ((= constants. do. object oracle (z?) serves to abstract the aspect of the problem with testing whether an expression imples an exampractice, the complexity of this problem ranges from By oracularizing it we are choosing to igunsolvable. complexity of this problem while we study the abstract properties language for representing = fi(x)- A convenient mapping in D is the subset of LISP expressions 2 h(ea), and 0 otherwise. It is more general than most definitions stract. inference, in that target objects need not be subsets instead, they are expressions (ei > ez?) fdx) tions set, X might Example 4 In many concept-learning problems, objects possess a subset of some group of t binary attributes, and a concept is a subset of the set of all possible distinct objects. A “ball”, for example, might denote the concept consisting of all objects with the “spherical?” attribute, regardless of other attributes (“red?“, “wooden?“, etc.). As a formal expression of this domain, let (51,. . . , xl} be Boolean variables, and D the set of sets of assignments of values (0 or 1) to all of the variables, par- and D = 2x, tially denote expressions; a set ordered by 2. For syntax h maps we may take an expression the set of Boolean to the set of satisfying as- of automobile attributes etc.), while an element (sedan, convertible, front-wheel drive, of D specifies those attributes possessed signments. by a particular model. containment relation. Let D be partially ordered by 2, There are many possible languages since then any assignment satisfying ei then must also satisfy eg. A sufficient set of examples for any expression can be formed the for representing D, such as the conventional one for set theory. A more algebraic language is a Boolean algebra over the elements it with elements of D represented by monomials of degree t (minterms). Thus the empty set is represented by xi . , . xi (x’ expression Expression “er + er is an example e2” is a tautology from the set of minterms, ment. since these (+ of ez iff the denotes represent Boolean implication), a single assign- Xl,..., denotes the complement this case h is a bijection of z), and {xi} between minterms = h(x1xhxi . . . xi). In and elements of D. If ml and mz are minterms, then ml is a positive every uncomplemented variable in ml occurs example of m2 iff uncomplemented Finally, note that for every inductive inference problem there is a dual problem, differing only in that D is partially ordered by 5 (rather than 2). e2 5 ei, and a negative Then er is a positive example otherwise. example of eg if in mz. Refinements Example 2 Let mapping integers define fl 1 f2 D be the class of partial recursive functions to integers. If fr and f2 are functions in D, iff for every integer x such that f2(x) is defined, In most applications, the mapping h: & --) D is not just an unstructured assignment of expressions to objects. Usually there LEARNING / 473 is an ordering on the k of the expressions underlying that is closely related to that semantics. For example, referring again to the size-color-shape example in the introduction, we see that the syntactic operation of replacing an attribute (“red”) by the don’t-care token (“?“) corresponds semantically to replacing the set of expressed objects by a larger set. The ordering k may only be a quasi-ordering (reflexive and transitive symmetric). But we can still use it to advantage generalizations three properties: statements. The but not antiin computing and specializations of hypotheses, provided it has an order-homomorphism property with respect to h; a computational completeness property. sub-relation called a refiraement; These we shall now define. and a Definition 4 Let ? be an ordering of E. The mapping h: E + D is said to be an order-homomorphism if, for all el and e:! in E such el k e2, h(el) that There is a dual definition of a downward refinement p for an ordering 5: p* = 5, and if (el,e2) E p then h(el) < h(e2). Nearly everything true of upward refinements has a dual for downward refinements, but to save space we shall omit dual 2 h(e2). r.e. condition on refinements means that they can be effectively computed. Thus if e is found to be too specific, an inference algorithm can compute the set 7(e), and y of each hyof these expressions, etc., in order to find a more general pothesis. We would like to know that, by continuing to refine e upward for every in this way, we will eventually obtain object d more general than h(e). This completeness Definition property an expression motivates the of refinements. 6 An upward refinement 7 is said for e E & if h(r*(e)) = {d 1 d 2 h(e)}. to be complete If 7 is complete for all e E E then 7 is said to be complete. Example 5 Referring 1, let rn:! k ml if every back to Example uncomplemented variable in ml also occurs uncomplemented in m2. Then h is an order-homomorphism. In Example 2, it is not clear how to define P2 > PI on arbitrary LISP programs so as to form an order-homomorphism. In [Summers, 19771, this problem is solved by restricting the class of LISP programs Example to have a very specific 6 Because many systems as the syntactic of the expressiveness form. of predicate calculus, use a first-order language L (or subset thereof) component of the inference problem. But what is the semantic component, 4 is the analogous case for sisted of sets of assignments, and how is it ordered? Example propositional logic, where D conordered by 2. With first-order logic, Herbrand models (i.e., sets of variable-free atomic formulas) take the place of assignments, and D consists of classes of (first-order definable) H er b rand models, ordered by 2. The sentence ‘dx (red(;c)V which everything A syntactic (~1 implies ‘p2, then any model of ‘p1 is also a model hence the models of cp1 are a subset of those of 92. importance of t to inductive inference 5 An enumerable (r.e.) reflexive-transitive upward binary closure k is a recursively relation on & such that (i) 7* (the of 7) is the relation 2; and (ii) for all el,e2 E l, if (el, e2) E 7 then denotes the set of espressions / SCIENCE 7 for h(el) 2 h(e2). The notation el such that (el, e) E y. 1, let 7(m) one of the uncomplemented Example 8 Let & be the set of first-order tions of atomic literals or their negation) quantification, assuming some fixed language finement for E (with respect to the ordering as follows ([Shapiro, be the Unify clauses (i.e., disjuncwith only universal Z. An upward re? of Example 6) is 7(C) is the set of clauses obtained one of the following operations: two distinct currences attributes. 19811): Let C be a clause. applying exactly variables x and y in C (replace from all oc- of one by the other). Substitute for every occurrence of a variable call or constant) general term (i.e., function is as follows: Similarly, if e is too specific, e ? e’ can be eliminated from refinement of Example x a mostwith fresh variables. (For example, replace every 5 by f(x1) where is a function symbol in C and x1 does not occur in C.) and obtain from it expressions that are more general or more specific. This leads to the notion of a refinement relation. t’t by complementing of cp2, and In order to take advantage of the efficiency induced by the syntactic ordering k, we need the means to take an expression Definition language from m by uncomplementing exattributes. Thus 7(z~z~. . . xi) = and {(XIX:. . . xi), (5’1x24.. . xi),. . . ) (2’1.. *x:‘tqxCt)}, 7 is a complete upward 7(zlz2 - - . xt) = 0. It is easily seen that refinement for minterms. A complete downward refinement for minterms 7(m) computes the set of minterms obtained from m imif Suppose a hypothesized rule e is found to be too general in the sense that there exists a negative example e- such that h(e) 2 h(e-). Then (assuming h is an order-homomorphism) any new hypothesis e’ such that e’ t e will also be too general, and hence need not be considered. then any hypothesis e’ such that consideration. 7 In the set of minterms m’ obtained actly one of the complemented - large(x)) designates the class of models in that is large is also red. ordering is as follows: given sentences (~1 and ‘p2 in l, define ‘p2 k ‘p1 iff k ‘p1 - ~32 (where -+ indicates plication). It is easy to see that h is an order-homomorphism: The Example 7(e) f Disjoin a most-general literal with fresh variables (i.e., symbol and x1 P(XI> or - 14x1>, where p is a predicate does not occur in C). For example, let r(x, y) stand for the relation Z- is-a-bloodthe-father-ojx, and m(s) for relative-ojy, j(x) f or tl le f unction the function the-mother-ojx. Let C be the clause 7.(x, j(y)) -+ r(x, y), meaning that if someone is related to a person’s father, then he is related to that person. The following clauses are all in 7(C): 0 r(x, j(x)) l f+4 4 , l +, f(y)) Each of the more models. -+ r(x:x) f M ) - derived - +( 4 , (4x, d Y) V +I, clauses ~2)) is easier to satisfy and hence has More examples of refinements over various domains are given in [Laird, 19851. The task of constructing a refinement can be tricky, because specializations one must ensure that all useful have been included. But given tion, it is basically a problem problem as has usually been essence, the refinement rules” or “generalization (e.g., [Michalski, Below tive inference assume in algebra, rather the case for most a simple using than a heuristic applications. In is a formal expression of the “production rules” found in many implementations 19801, [Mitchell, we sketch generalizations or the formal defini- an upward the trivial the correct 19771). “bottom-up” is finite refinement. 2. There for all e. (7 is then is an expression said emin E & such for induc- For simplicity we to be locally finite.) that e ? emin for all 3. 7 is complete The algorithm for emin. repeatedly calls on EX and tests the current hypothesis against the resulting example. If the hypothesis is too specific, it refines it, placing the more general expressions onto a queue and taking a new hypothesis from the front of If the hypothesis no refinement) and takes shown that the algorithm hypothesis the target. as a possibility, algorithms since provided such as the one above the is too general, a new one from will eventually EX presents it discards it (with the queue. It can be converge to a correct a sufficient set of examples hypotheses. Among these are expressions which extend R to R + WI; but other expressions, such as R*, will also be constructed and held on the queue in order. If another positive R + w1 will be discarded, but R’ will example w2 is presented, be considered before the refinements of R + w1 (including the trivial one R + w1 + ~2). It can be shown that the algorithm will converge to a correct expression, whether or not the target is a finite set of strings. for of the Refinement The induction-by-refinement Initialize cient algorithms directly since it is too general of specific properties of the domain. Instead, H + emin. QUEUE + emptyo. EXAMPLES + empty0. EXAMPLES + EXAMPLES {EXO}. U refinement operators and specializations While H disagrees with any EXAMPLES: Using the (>?) oracle, check that Hz no negative examples, and H 2 some positive example. If so, as measured sentation et. al., we can construct a top-down algorithm using 7 and emoz. Other algorithms are also possible, depending on the properties of the domain and the refinements. Most inductive inference algorithms in the literature are either top-down or bottom-up ([Mitchell, 19821 suggests using both in parallel). And for some domains, one direction seems advantageous over the other (conjunctive logical domains, for example, seem to prefer generalization seems to occur to specialization). mainly when the for computing of hypotheses. This directional asymmetry refinement is locally finite in by the probability eration is clearly refinements 7(e), sibilities more expressions that it is worth dle the so-called Disjunction when the rules are expressed as ([Biermann and Feldman, 19721, when the rules are expressed as 19851). observing how refinement Problem. algorithms In the context appropriate generalizations Recently several researchers han- of classical governing the pre- using the example are consistent with Z. yielding .z. There are many domains in which the partial order is too linear or too “flat” to be of much use in searching for hypotheses. Consider, for example, the problem of finding an arithmetic recursion relation of the form sn = j(sn-1) to explain a seWe might, for instance, try the hypothesis quence of integers. 2 only one integer in the s, = s,-1 + 5 and find that it explains general ([Laird, important roles played and in the definition of distribution - e.g., 7(e, Z) is computed general ple, are easier expressions to yield effi- to take advantage the primary value being employed; but instead of generating all the examples are used to reduce the set of pos- sequence. At this point, dering used in Example to infer bottom-up is not expected of examples ([Valiant, 19841, [V&liant, 19851. [Blumer 19861). I n many of these algorithms, a refinement op- one direction but not in the other. Note that this is a syntactic property, not a semantic one: regular sets of strings, for examor as logical axioms 1981]), but top-down Approach have been looking for efficient inference algorithms that yield (with high probability) rules whose “error” is arbitrarily small, add 7(H) to QUEUE. H + front(QUEUE) . By duality model of the model is the way it clarifies the by the semantic and syntactic orderings, Do Forever: Finally, apply are strings in and not in the the alphabet (0, l} and examples target set. If the current hypothesis is represented by a regular expression R, and a string w1 is presented that is not included ALGORITHM UP-INFER: regular lead to operator 7 to hypotheses in the order in which they are discarded, the minimal generalization (adding only the one element to the set) is not necessarily the first one tried. For example. suppose the domain is the class of regular sets of strings over Limitations automata [Shapiro, it might algorithm will generalize by applying 7 to by R, a refinement R, producing a set of new expressions to be tried in turn as eE t. the queue. generalization rule. Since refinement algorithm that 1. 7(e) concept-learn 7, this refers to the problem of forming a “reasonable” gener. _ization from examples in a domain that includes a disjunction operation. The trivial generalization, consisting of the disjunction of all the positive examples, is usually unsuitable since it will never converge to a hypothesis representing an infinite set. On the other hand, it is undesirable to eliminate function than the “less defined than or equal 2 is no more useful for finding a simple generate-and-test Refinement algorithms have generally the examples are subject to “noise” (e.g., performed [Buchanan to” ora more approach. poorly when and Mitchell, 19781). They also tend to require that all examples be stored, that later refinements can be tested to avoid over-generalization or -specialization (e.g., [Shapiro, 19821). These two limitations LEARNING / 4’5 so are inevitably related: for, a procedure which is tolerant [8] Crespi-Reghizzi, of faulty examples it chooses to retain. all refinement algorithms rection (up or down). It is interesting in the literature Consequently to note that refine these nearly in only one di- algorithms cannot recover if they over-refine in response to a faulty example. By contrast, an algorithm which can refine upward and downward has the potential for correcting an over-generalization resulting from a false positive example when subsequent examples so indicate (e.g., [Shapiro, 19821). Finally - and most on a fixed incorporate serious - the refinement learning of geometric shapes, for example, define complex patterns in terms of the of the would language (edge, be too complex space (e.g., technique relies algebraic language, without suggesting any new “terms” or “concepts” into the language. way to In the we could in principle elementary relations E. By contrast, a suitable triangle, box, inside) could set of higher-level make the refinement concepts path to a successful rule short enough to find by searching. But I am unaware of any general technique for discovering such new terms (other than having a friendly “teacher” present them explicity). A successful in the study model of this process would of inductive learning. be a significant advance [lo] [11) Hardy, S. Synthesis IJCAI-75, pp. [12] Hunt, of LISP programs 268-273. E. B., J. Marin, Induction. and New York: [13] Laird, P. D. Inductive 376, Department New Haven, the exposition. Infor- limit. Proc. Experiments P. J. Stone. Press, in 1966. inference by refinement. Tech. Rpt. of Computer Science, Yale UniverCt., 1986. [14] Langley, P. W. Descriptive discovery ments in Baconian science. Tech. Computer Science versity, 1980. Department, processes: experiRep. CS-80-121, Carnegie-Mellon Uni- logic and its applications to [15] Michalski, R. V ariable-valued pattern recognition and machine learning. In Computer Science and multipte-valued logic theory and applications, D. Rine, ed. Amsterdam: North-Holland, 506-534. R. Pattern recognition inference. IEEE Trans. Pat. (PAMI-2):4 (1980) 349-361. I am especially grateful to Dana Angluin for many helpful discussions of this work. Thanks also to Takeshi Shinohara, whose careful reading of the original report identified some errors and infer- from examples. Academic -[16]_ Michalski, Acknowledgements for grammar 71, B. Gilchrist, ed. 1972, pp. 524-529. Gold, E. M. Language identification in the mation and Control IO (1967) 447-474. 1975, pp. improved model [9] Feigenbaum, E. A. The simulation of verbal learning behavior. In Computers and Thought, E. A. Feigenbaum and J. Feldman, eds. New York: McGraw-Hill, 1963. sity, circle, above, etc.), but the description to find by searching through the rule S. An effective ence. In Information Processing New York: Elsevier North-Holland, examples cannot expect to find a hypothesis consistent with every example seen so far and hence must be selective about the [17] Michalski, ing. [18] Mitchell, R. Theory AI 20:2 and methodology (1983) of inductive learn- 111-161. T. M. Version approach as rule-guided inductive Anal. and Mach. Intel. Spaces: a candidate Proc. to rule learning. elimination IJCAI-77, pp. 305- 310. [l] Angluin, 293 D. Inference of Reversible (1982) 741-765. Languages. J. ACM [2] Angluin, D. and C. H. Smith. Inductive Inference: theory and methods. Computing Surveys 15~3 (1983) 237-269. The logic of learning. In Advances in [3] Banerji, Ranan. Computers 24, M. Yovits, ed., Orlando: Academic Press, 1985, pp. 177-216. [20] Shapiro, E. Y. (1981). facts. Tech. Rep. ence, [21] Shapiro, Ph. A. W. and J. Feldman. On the synthesis of finite-state machines from samples of their behavior. IEEE Trans. Comput. C-21 :6 (1972) 592 - 597. [5] Blumer, A., A. Ehrenfeucht, D. Haussler, Classifying learnable geometric concepts nik-Chervonekis Dimension. Theory of Comp. [6] Buchanan, learning May, B. G. and T. of production Proc. M. Warmuth. with the Vap- 18th ACM Symp. 1986. M. Mitchell. rules. Model-Directed In Pattern-directed ference systems, D. A. Waterman eds. New York: Academic Press, in- and F. Hayes-Roth, 1978, pp. 297-312. [7] Cohen, P. R. and E. A. Feigenbaum, eds. The handbook of artificial intelligence, Vol. III. Los Altos: William Kaufmann, 476 / SCIENCE Inc., 1982. Intel- Inductive inference of theories 192, Department of Computer from Sci- Yale University, E. Y. (1982). D. dissertation, Yale University, M.I.T. Press. (221 Summers, [4] Biermann, Artificial [19] Mitchell, T. M. Generalization ligence l8:2 (1982) 203-226. References struction as search. New Haven, Algorithmic Computer New Haven, P. D. A methodology from examples. [23] Valiant, L. G. A theory (1984) 1134-1142. Ct., 1981. program debugging. Science Department, Ct., 1982. for LISP J.ACM Published program 24 :1 (1977) of the learnable. of conjunctions. [25] Winston, descriptions P. H. Learning structur’al con- 161-175. C. ACM [24] Valiant, L. G. Learning disjunctions IJCAI-85, pp. 560 - 566. amples. In Psychology of Computer ston, ed. New York: McGraw-Hill, by 27:ll Proc. from ex- Vision, P. H. Win1975.