From: AAAI-87 Proceedings. Copyright ©1987, AAAI (www.aaai.org). All rights reserved. Learning and Representation Ghan Jeffrey C. ScM.immer* Department of Information and Computer University of California, Irvine schlimmer@ics .uci. edu 1986). Science 92717 First, a new method tinuously new method with to overcome guage. concept limitations After Secondly, the existing refine a distributed To remain effective without human interaction, intelligent systems must be able to adapt to their environment. One useful form of adaptation is to incrementally form concepts from examples for the purposes of inference and problem-solving. A number of systems have been constructed for this task, yet their capability is limited by the language used to represent concepts. This paper presents an extension to the concept acquisition system STAGGER that allows it to utilize continuously valued attributes. The combination of methods employed is able to dynamically acquire appropriate representations, thereby minimizing the impact of initial representational bias decisions. Of additional interest is the distinction between the computational flavor of the learning methods, for one is similar to connectionist approaches while the other two are of a more symbolic nature. is added that discretizes valued attributes. methods inherent this weight and STAGGER is able in its initial concept lan- some related work, this pa- per describes each of the three learning demonstrates that description, briefly describing con- by combining methods and then their interaction. IL A number utilizing of researchers numerically have studied valued information the problem in symbolic of con- cept learning. For example, Michalski (1983) presents the closing interval generalization rule. It specifies that if two values are found in positive examples, assume that all of the values between them will be. This mapping from continuous to discrete values is similar to the type of approach used by Lebowitz’s (1985) UNIMEM values in both a generalization which partitions and a data-driven real manner. In the first, clusters of examples formed on the basis of disConsider the task of constructing a concept description given a series of examples and non-examples. This has been addressed by a number of learning systems, yet many have limited concept capability language. a subset of the numeric values present for gaps indicated inherent in the by the distribution If the language is too (1986) due to inflexibility representational crete attributes partition a real range by implicitly grouping the values. The latter, data-driven technique searches of real values across objects. there will be some concepts which cannot be The restriction imposed by the represented or learned. intervals, concept sion tree roots by interpreting restrictive, representation language is necessary, though, and without it the learning method could do no better than to guess randomly at the concept’s definition (Utgoff 8c Mitchell, 1982). To all eviate this bind between flexibility and tractability, resentational a system might modify the underlying language creasing the language in some way or another, to accommodate cepts ok reducing it to improve appropriate concept more possible con- the possibility of finding an description. This paper describes acquisition rep- either in- a two part extension system called STAGGER (Schlimmer to a concept & &anger, “This work was snpported by grants ONR N0001485-K-0854, ONR NOOO1484K-0391, AM MDA903-85-C-0324, and NSF ET-85-12419. range. Quinlan’s ID3 sy st em forms a number of competing centered around potential pairs of splits in the real-value ID3 then considers these intervals as possible deci- them as binary valued at- tributes. A distinctly different approach for handling numeric in- formation is taken by Bradshaw (1985) in his speech understanding system. Instead of attempting to map the continuous information representing a verbal utterance into some set of symbolic values, his system retains a set of attribute averages for each word concept. Rendell’s (1986) PLS family of concept induction sys- tems incorporate both partitioning and averaging approaches. In some instances concepts are represented as “rectangles” or value ranges, while in other cases concepts appear to be described in terms of their central values. Schlimmer 511 Along representational lines, Utgoff (1982) and Mitchell were perhaps the first to address the issue of constrictive representational assumptions. In this and subsequent work 1986) they develop a method which explicitly (Utgoff, tifies shortcomings descriptive language. STAGGER’S numeric learning ing class, for it divides termined number between of STAGGER alleviates description justment belongs ones. Perhaps language. This de- more impor- the three learning the impact occurs in a continuous of the methods in the partition- real values into a dynamically of discrete tantly, interaction concept iden- and triggers procedures for relaxing the components of an ill-fitting initial representational ad- and natural manner; each assists the others while performing its own task. STAGGER uses three and represents concepts interacting learning components as a set of weighted, scription pieces. One of the learning weights, another adds new Boolean adds new pieces corresponding symbolic methods deLS adjusts the pieces, and the third to an aggregation Figurewedium ments influence learning, mechanism for they exert influence on each other by changing proceeds. Given a new example, Concept Concepts weighted, representation are represented symbolic combination. Figure 1 depicts a typical concept description for size=medium & color=red. Each descriptive element is dually weighted to capture positive and negative implication. One weight formalizes the element’s sufficiency (solid weights are based on the logical ical necessity Gaschnig, (LN) LS LN measures = = A weight (LS) These and log- used in Prospector (Duda, .m (1) p(Tmatehedlexample) p(lmatchedl-example) greater ness; less than one denotes of each matching odds(Ejfeatures) piece and the LN weight = odds(E) x LS x Vhf LN (2) V4-f The resulting matching itive example and reflects the degree of match between concept description score is the odds in favor of a pos- and the example. This holistic the flavor of matching differs from many machine learning systems in which a single characterization completely influences concept prediction. . Modif$ng element The weights associated weights with each of the concept descrip- than one indicates an element in terms that predicts pretation. For both, a weight of one indicates non- = matched&example 1 matched&examplej/ 1 examples -examples (31 predictive- EN has the same range but the opposite . LN= ~matched&example I Tmatched&-example I examples Texamples inter- irrelevance. lTo convertodds into probability,divideodds by one plus odds. Machine Learning & Knowledge Acquisition examples; these counts are used to compute estimates of the probabilities in Equation 1. LS and is interpreted examples. 512 the tion elements are easily adjusted by incrementally counting the number of different matches between an element and LS ranges from zero to infinity of 0dds.l =s- -example. sufficiency 1979). & Hart, ele- and the other represents its (dashed line), or lmatched necessity concept Following et aI., 1979), the prior expec- of the concept description may be a single attribute-value pair, a range of acceptable values for a real-valued attribute, or a Boolean line), or matched ==sexample, of its identity. of each unmatched one. and matching Each element description. tation of a positive example is metered by multiplying in in STAGGER as a set of dually- pieces. all of the weighted expectation used in (Duda the LS weight A. & red concept of real- values into a few discrete ones. The interaction between these methods may be viewed as a form of representational the substrate from which induction I.24 .m~.m~.m--.m~ Keeping counts of matchings between elements and examples also allows calculating the prior expectation example: odds(example) = lexamplesl/l~examplesl. for an By adjusting scriptive the weights associated connection& with each of the de- STAGGER is behaving elements, model. Without hidden as a single layer units, these mod- els suffer from the same representational STAGGER does without its Boolean limitations that learning method. Both are unable to assign weights to a combination this severely limits the number to only linearly-separable ones. point of view, of values, and of discoverable concepts From a representational the weight method can only form concepts in terms of existing description elements. Instead of including an element for all possible Boolean combinations of the attribute-values, tively this method begins with the rela- strong bias of only single attribute-value elements. These end-points through Forming new Boolean break up the range into discrete pairs. search frontier is the set of single attribute-value The B oo 1ean method uses three search operators, specialization, ements generalization, to the search frontier. proposing a new element expectation error. constructive by extends the descriptive process while retaining of a limited the representation. is expected STAGGER is behaving to be a too inclusively, too and thus a more specific element may be needed. So, STAGGER expands a new AND the search frontier formed necessary for the concept. is unmatched ements typically by tentatively from two elements The selection ments is based on two observations: element search is limited when a non-example positive example, generally, adjusting properties For instance, The only when STAGGER makes an This cautiously power of the weight adding and inversion to add new el- which are of component ele- at least one necessary in a non-example, and necessary el- have strong logical necessity weights. The other type of prediction error also triggers Specifically, A guess that a positive example is negative is pansion. overly specific. To correct for this underestimation, search is expanded to include a more general element; a new OR formed from two sufficient elements is tentatively added. Both predictive ments. errors are opportunities Further details concerning documented Though in (Schlimmer the space of possible to invert poor ele- the Boolean $t Granger, method are Boolean point. is valued at- tributes. Therefore, STAGGER has a third learning method which extends this space by adding discrete values for real value ranges. D. Partitioning In order real-valued to carve into a set of discrete ple statistic intervals, real-valued STAGGER retains for a number of potential’interval a two by two and negative exend- as the end points By interpreting attributes The utility these divisions values, STAGGER maps the of discrete of subsequent instances into discrete measure is similar to Equation 2, for it involves the prior odds of each class and a conditional probability ratio similar to LS and LN. IClCZ88@.8l U( end-point) = p(cZass; 1 < e-p) odds( class;) x These conditional probabilities number of positive may be computed and negative examples and greater than the measured end-point. tial end-points is independent (4) p(classi 1 > e-p) i=l from the with values less Since the poten- are taken from actual examples, the method of scale considerations and does not en- tail any assumptions about the range of values. Furthermore, because partitioning is driven by a statistic based on class information, the method is able to uncover effec- tive partitionings even when the values are uniformly distributed across all classes, something a gap finding method (Lebowitz, 1985) is unable to do. A straightforward strategy for fractioning the real-value range would be to choose the end-point with a maximal utility and thereby ues: greater divide the range into two discrete val- and less. Rowever, for concept learning tasks it would be more effective to divide the range into a number of discrete values. So after applying a local smoothing the end-points function, STAGGER chooses that are locally maximal. These end-points represent pivotal values, for Equation 4 favors those that are predictive of concept indentity. This approach has the advantage that it naturally and an effective continuous formed end-points the real-valued range, attributes values in successive into their discrete sults in example examples counterparts. descriptions input requirements ing methods selects appropriate number of discrete values. may This mapping that are consistent above. Furthermore, re- with the of both the weight and Boolean described with be trans- learn- it embodies a type of representational learning, because by partitioning values into ranges, the concept description languages for * the other learning methods is expanded. attributes up an attribute’s end-point, A measure apphed to these numbers indicates useful divisions in the value range. Having aggregated combinations and in turn amples with values less and greater than this potential 1986). large, it does not include states for numerically for each potential which require finer distinctions, the ex- Each new exam- record is kept of the number of positive adds new elements by beamSTAGGER selectively searching through the space of all possible Booleans. The initial values. and, to naturally this method transforms successive examples into a palatable form for the weight and Boolean learning methods. real-valued combinations examples the best are utilized ple supplies a value to update these statistics, values. c. are taken from processed a beam-search, range a sim- end-points. E. Interactions between STAGGER'S three learning the learning methods methods cooperatively inter- Schlimmer 513 BOOLEAN LEARNING rz I I I I I I I \ / // To illustrate the functioning consider the set of potential NUMERICAL LEARNING ,e-- picted piiqJFigure 2: Interaction act as Figure between 2 depicts. method; Note that identifies the partitioning 5 and 15 that are used to partition three discrete measure the two local maxima near the size attribute into values. U( end-point) 25 Boolean learning method base for the weight adjusting it is restructuring method 3. 4) clearly STAGGER'S three methods. The alters the-representational justing in Figure (Equation of the partitioning method, end-points for this task de- the input so the latter to the weight ad- is able to capture the concept’s description. The numeric learning method has this administrative role for both the weight and Boolean learn- ing processes; it rewrites the real-valued attributes into a form suitable for induction by the latter methods. pendent also exert influence on the rep- learning resentational methods processes which counsel them. The de- The Boolean 0 method draws its components from the pool of ranked elements maintained by the weight learner. The weighting method also provides a weak form of feedback meric method: the similarity uation function (Equation (Equation attributes 2) implicitly for the nu- the numeric 4) and the matching ensures that will be amenable IV. between the division to weighting Empirical of real and matching. The interaction of the three learning methods is perhaps best illustrated by examining their behavior on concept learning tasks. Consider STAGGER'S acquisition of a pair of simple object concepts. Each object is describable in 0 and 20), its color (one of 3 discrete values), and shape (3 discrete values). For the first concept, an object is a positive example if red and between 5 and 15 in size. Optimally, the size attribute should be divided into three ranges: size < 5, 5 < size < 15, and 15 < size. In each of 10 executions, STAGGER'S numeric partitioning method discovers this threeway split, and its Boolean combination technique forms a conjunction combining the middle value of size and the color red. The weight adjusting method further gives this element more influence over matching than any other. The following is typical of the elements ative action of the three learning 514 formed by the cooper- methods. Machine Learning & Knowledge Acquisition 10 SIZE I 20 for 5.0 ssize< 15.0 & red. equation Performance of its size (real value between I 15 I 5 Figure 3: End-points eval- The complete, denly. Though conjunctive th e methods element does not appear sudare not explicitly synchro- nized, they appear to operate in a staged manner as Figure 2 indicates. First, the numerical partitioning method begins to search for ranges. terms 1 0 a reasonable At the same time, way to partition the weight adjusting real method searches for appropriate element strengths. The weight of color=red is adjusted at this time, but combination with the size attribute must wait until the numerical method settles down. After processing about 50 examples, the tripartite division of the size attribute is stable, and the weight method is able to assign a strong LN weight to the middle value of the size attribute. After this, the Boolean method combines the size and color elements to form the element depicted above. Weight adjusting finishes the job by giving this element strong LS and LN values. Regrettably, these three learning methods are not sufficient for all concept learning tasks; there are some concepts for which the numerical method is unable to uncover an effective partitioning. For example, consider the concept of objects that have a size between 5 and 15 or are red but not both. Figure 4 indicates able to identify limitation that the numerical a reasonable partitioning method is un- in this case. This also arises if we consider the capabilities of the U ( end-point) the weight method: if a concept involves ‘an exclusive-or of a real-value range, STAGGER is unable to discover it. 251 Acknowledgements Thanks to Rick Granger and Michal Young who provided much of the early foundations 0 5 Figure 4: End-points for this work; to Ross Quin- Ian for his assistance in formulating the numeric learning method; to Doug Fisher for insight on the interactions be- 20 10 15 SIZE for 5.0 lsize< 15.0 @ red. tween the weight, Boolean, and to the machine and numeric learning methods; learning group at UC1 for ready dis- cussion and suggestions. weight method used without the other methods; the weight learner alone is only able to describe linearly-separable concepts (which method does not include exclusive-or). allows it to overcome its representational this limitation language. direct the numerical method, able to move beyond by rewriting If the Boolean method could then the latter would also be linearly-separable eferences The Boolean concepts. This is an area for future study. Bradshaw, G. sounds: L. (1985). Learning to recognize ’speech and model. PhD thesis, Department A theory of Psychology, Carnegie-Mellon Duda, R., Gaschnig, J., & Hart, in the Prospector V. Conclusions ration. and Future University, Pittsburgh, PA. In D. Michie cro electronic P. (1979). consultant age, Model design system for mineral exploExpert (Ed.), Edinburgh: systems in the mi; Edinburgh University Press. This paper describes of learning learning a three part approach a concept from examples. method modifies a simple concept changing the weights associated A second, binations symbolic learning of these descriptive method to overcome third learning set of discrete elements ranges, style description with descriptive method forms by elements. Boolean com- and allows the first a representational method to the task A connectionist shortcoming. divides real-valued attributes The into a so other methods are able to conOverall, the concepts. struct descriptions of numerical interaction between these three methods is a type of co- operative representational learning. The numeric method changes the bias for both the Boolean and weight method. Similarly, the Boolean method forms new compound ments and thus increases the representational of the weight adjusting method. One drawback illustrated in the previous ele- capabilities section arises because cooperation between the methods is incomplete. Dynamic feedback is lacking for the numeric method. Though it attempts effective learning information to form about the progress to alter its course of action. method value partitions by the other methods, that allow it does not use of learning at those levels Consequently the numerical can only uncover partitions that can be utilized by Lebowitz, M. (1985). generalization. Michalski, Categorizing Cognitive R. S. (1983). ductive learning. & T. numeric information Science, A theory and methodology In R. S. Michalski, M. Mitchell Machine (Eds.), tificiaZ intelligence approach. for 9,285-308. of in- J. G. Carbonell, learning: Los Altos, An CA: ar- Morgan Kaufmanu. Quinlan, J. R. (1986). Learning, Rendell, of decision trees. Machine Induction 1, N-106. L. (1986). A g eneral framework a study of selective induction. for induction Machine and Learning, I, 317-354. Schlimmer, J. C., & Granger, R. H., Jr. (1986). mental learning from noisy data. Machine 317-354. Utgoff, P. E., & Mitchell, T. M. (1982)) appropriate bias for inductive concept ceedings ligence mann. Utgoff, of the National (pp. 414-41‘7). P. E. (1986). Incre- Learning, I, Acquisition of learning. Pro- Conference on Artificial Intel- Pittsburgh, PA: Morgan Kauf- Shift of bias for inductive c\oncept learning. In R. S. Michalski, J. G. Carbonell & T. An artificial inM. Mitchell (Eds.), 1Mac h ine learning: telligence approach (Vol. 2). Los Altos, CA: Morgan Kaufmann. Schlimmer 515