From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Experimentation Guided by A Knowledge Graph Jan M. 7,ytkow and dieming Zhu zytkow@wise.cs.twsu.edu jxzhu@data.cs.twsu.edu Department of Computer Science Wichita State University, Wichita, KS 67260 U.S.A. Introduction Discoverers always seek the unknown. They examine the world around them, and ask: what are the boundaries that separate the known from the unknown? Then they cross the boundaries to explore the world beyond. Machinediscoverers can use the same strategy. Wewill discuss a knowledge representation mechanism that makes it easy to find the unknown. In this representation, the current state of knowledgecan be examinedat any given time to find all new directions for future discovery. Wecan call this approach a knowledge-driven goal generation. Each state of knowledgecan be transcended in different directions, so that goal generator typically creates manygoals and should be followed by goal selector. In this paper we abstract from goal selection and consider only the indeterministic mechanismof goal generation. A big advantage of our knowledgerepresentation lies in separating the knowledge from the method. Goals correspond to the limitations of knowledge. They can be generated by examiningthe current state of knowledge, while the selection of the methodfor each goal is a separate issue. Newdiscoverers can use new methods and are not bound by methods used by their predecessors. Wewill present the interaction between our knowledge representation mechanism and mechanism for goal generation. Those mechanisms, added to the FAIIRENHEIT discovery system (Zytkow, Zhu & Hussam, 1990), made the system more flexible and improved the efficiency of exploration. Applications of FAIIRENHEITinclude, among others, automated experimentation and discovery in a chemistry laboratory (Zytkowet.al., 1990; Zytkow, Zhu, &Zembowicz, 1992). Twoother machinediscoverers use experience selection mechanisms different from FAHRENHEIT: LIVE (Shen 1993) and DIDO(Scott & Markovitch 1993). LIVEcombines exploration withproblem solving. The needs oftheproblem solver andfailed predictions drive themechanism whichselectsnewexperiments. LIVE isanoptimistic generalizer anditcangenerate a universal theory fromoneexperience. Because it induces 23 universal theories, it does not see the reason to search for boundaries before it realizes a failed prediction. FAHRENHEIT is more conservative and it literally expects the limits to everything. Wheneverit reaches any regularity R, its goal generator suggests to determine the limitations of R. So FAHRENHEIT, in distinction to LIVE, actively seeks the boundaries and then makes experiments beyond the boundaries to get empirical data for new theories. DIDObrings the probabilistic perspective to the exploration, absent both in LIVEand in FAHRENHEIT. It expresses its theories as a networkof probabilistic rules. It refines its rules to make them as deterministic as possible, so it may eventually find both the deterministic regularities and their boundaries, but after a brief initial phase, the combinedscope of the rules covers the whole space, so there is no need to search for areas beyond the boundaries. Instead, DIDOsearches for the areas of greatest uncertainty and tries to improverules in those areas. Both LIVEand represent the scope of each rule by a conjunction of conditions on single attributes. In contrast, FAHRENHEIT can find sophisticated boundaries, which may be described as functions of many variables. Phase transitions in physics or chemistry are typically described by such boundaries. If manypartial regularities are present, LIVE,DIDO and FAHRENHEIT can discover them gradually, continuing the search until they reach an empirically complete deterministic theory. In contrast, BACON (Langley et al, 1987) was limited to situations in which single multidimensional regularity holds for all datapoints, so the exploration goals were simple. Several other systems, including ABACUS (Falkenhailmr Michalski, 1986), IDS (Nordhausen & Langley, 1990), and KEPLER(Wu & Wang, 1989) do not conduct autonomous exploration and must be fed with data. Unlike FAHRENHEIT, theyareunableto searchfornew unexplored areas. In thispaperweconcentrate ontherepresentation of a theoryincrementally discovered by FAHRENHEIT. A theoryis organized as a knowledge graph,which models thetopology oflawsandtheirboundaries in the spaceof variablescontrolled by FAHRENHEIT. We xZ -X2 Figure2(a). In a onedimensional sequence of data I isvaried, x2 is constant),FAHRENHEIT discoversa regularity, the IoweKIb)andupper(ub)boundary for the regularity, andtwocollections (seeds2 andseeds3)of data whichdonotfit the regularity. butfromwhichthe furthersearchfor a newregularitycanstart. Goal 1 ~enqrali~ ~o~" / / Goal 2 / FigureI. FAHRENHEIT candiscovermultipleregularities an</ boundariesbetween themin a multi-dimensional data space.(a) OurwaJk-thmugh example.0>) Theintermediatestate: someregularities andtheir boundaries havebeendiscovered.(c) Thefinal state of knowledge demonstrate how the graphanalysis leadsFAHRENHEITto newdiscovery goals.Newdiscoveries, made inresponse tothosegoalsleadtothegrowing graphof nascent knowledge. Empirical exploration of a numerical space Consider N independent variables zl,.. ., z N and one (for simplicity) dependentvariable y. Each variable limited in the scope between a minimumand a maximumvalue. Weconcentrate on numerical parameters, because they are very important in natural sciences, and because we use the properties of topological closeheSS and topologically coherent regions in a numerical space. Experiments are the only way for obtaining information about E. Experimenter can set the values of all independent variables and measure the values of the dependent variable. Each experiment consists in enforcing a value for each independent variable zi,i - 1,...,N, and in reading the value of y. The task of an autonomousempirical discoverer is to generate aa complete and empirically adequate theory of E as possible, including regularities for the dependent variable and boundaryconditions for those regularities. As an example, we consider a space of two inde- 24 :-- --" Goal 3 " upI~r! ! lower!-gl---I~uI~ity foun~. "..bou nd .a~~ \bbund~/ Figure20>).Aftera regularityhas beendiscovered, in the nextstep, eitheroneof thethreegoals,illustratedas circles, can be chosen. pendent variables z1 and z 2, with regularities Rl, R2, and Rs divided by boundaries Bl, B2, and B3, depicted in Figure l(a).. Manysimilar diagrams can found in natural sciences publications. If Figure l(a) is interpreted as a phase diagram for water, RI, R2, and //3 are state equations for ice, water and steam, the boundaries indicate melting/freezing and evaporation/condensation, while/~ indicates the triple point of water in which steam, water, and ice are in equilibrium. Knowledgeof the exact location of such points is an important part of theories. Wedo not show the Y dimension in our figures, because in this paper we are not concerned with the form of equations RI, R2, and R3. Weare concerned with boundary equations BI, B2, and B3, and those are expressed in terms of independent variables. Incremental construction of knowledge of the graph In FAHRENHEIT, discovery of a complete theory consists of manysimple steps, performed repeatedly, and falling into few generic categories. Each task is triggered by a specific incompleteness in the graph of knowledge. ub lb P3 P4 ub lb P5 P6 ub lb l~ Ps ub lb P9 PIo 2...................... ,X Figure3. Exploringdifferent valuesof x2, FAHRENHEIT discovers a sequence of one-dimensional regularitiesandtheir bound2. aries, andthengeneralizesthemto the seconddimension x Regularity in one independent variable When FAHRENIIEITknows nothing about the do1, main, it selects one independent dimension, say z while z~ and other independent variables are kept constant. FAHRENHEIT typically conducts experiments in sequences. Based on the initial value of zI and the incrementfor z1, it makesn values (z~, i = 1,..., n) zI and reads the corresponding values of y, leading to the data set (zl,yi),i = 1,...,n. Such a data set can be treated as a lookuptable, so it is already a piece of knowledge. However, FAHRENHEIT looks for equations relating zI and y, using the equation finder (EF) described by Zembowiczand Zytkow (1991). After EF finds an equation, the knowledge graph consists of one node, indicated by the central node Rl(z l) in Figure 2(a). At this time, three goals becomethe candidates for the next discovery step. They are shownin Figure 2(b). If the search for the upper the lower boundaryis selected, additional experiments follow, until the boundaryis found. After two boundariesare detected, that is, after goal 2 and goal 3 in Figure 2(b) have been completed, the results form the graph depicted in the lower part of Figure 2(a). Somedata generated in the process boundarysearch do not fit the regularity. They become the seeds from which the search for regularities can continue (seeds2 and seedsa in the graph). At this time, the graph in Figure 2(a) yields three goals: 1. generalize the regularity Rl(z1) to z2; 2. find regularity for zI in the area of seeds2; Figure4. Exponential complexity of the knowledge tree in BACON and FAHRENHEIT. (BACON in thicker lines). Pl, P2 ..... pa are parameters in the regularities,suchas slopeandinterceptin a linear equation.Forsimplicity,weassume twoparametersper regularity. Horizontallinks showrelationshipsbetween regularity andits boundaries.Forsimplicity,weshowonlythe horizontal linksat 1. thelevelx Recursive generalization of empirical regularities Generalization to z2 uses the process described in Section 3.1, but applies it to z2 as independentvariable, while the dependentvariables are all parameters in the regularity Rl(z l) and the boundaries of Rl(zl). Such a process is repeated for each new value of z z. Step after step, a numberof one dimensionalregularities for z~ are discovered, the boundaries of those regularities, and the seeds in the areas outside the discovered regularity, as depicted in Figure 3. Whenenough values of parameters in the equation Rl(z l) are collected, Rl(z 1) can be generalized to Rl(zl,z ~) by application of EF to each parameter in Rl(z 1) (cf. BACON, Langley et.al., 1987). After each equation for z~ is discovered, the newgoals are to seek the upper boundary in za and the lower boundary in z 2 for that equation. For each of these goals, the experimentation proceeds according to the experimental strategy of the boundary search. The results are depicted in Figure 4. Representation of knowledge by a tree Initially, the generalization of a graph of knowledge fromz1 to z-~ producesa tree-like structure, depicted in Figure 4. That tree resembles only remotely the topology of regularities and their boundaries, illustrated in Figure la. Manynodes in that tree represent the same line or the same point in Figure la, so that, whenFigure 4 is used to set new discovery goals, the samegoal will be created manytimes. Wewill now discuss reduction of the tree to a graph, depicted in Figure 5d. Tree reduction to a graph: identification of nodes Initially, the generalization produces a tree of nodes, 1as depicted in Figure 4, including the links at the z level, betweenparameters in the regularity RI (z1) and between regularity Rl(z1) and its boundaries. Wewill 3. find regularity for z1 in the area of seeds3. Eachof these goals can be selected. The actual history of the exploration depends on the preference in goal selection. Should goals 2 or 3 be picked up, the exploration would be similar to one described above. 25 .. m.. (a) p ,. ........... , .............................. sp~ .,,° ........ s , ........ "’,..-..... . ....’° ~~ ....... ..... , .,.... .... " . R.. V ..... ," ,, """ ,, ,, ..... .~., "’., ~ ..... li’ ,,r’"’:,l,’it ":,,,,!llI" ¯ :| . XZ Ix i’"’"’°°’" -- .°oO. ................... "° °° ...........R ............... ....... ¯ ’,.,°° ..... B X2 ! X (C) ( newedge, lower limit on Rl(xl, x2) ) . ’l°’°’" ,2 .o ............. ° ..................... ~,,.,.,..,,..,, B 1x (d) Figure 5. WhenFAHRENHEIT generalizes the regularities andboundariesto x2, a tree of newregularities and boundariesis created, shownin (a). Manynodesat the 2 l evel are i dentified, creating intem~ediateraphs g shownni (b) and (c). Thick lines in (b). ( (d) indicate links betweenregularities andtheir boundaries.Dottedlines indicate links betweenboundarieson regularities and boundariesof boundaries.Boundarieson boundarieshavephysical interpretation of phase equilibria, such as the triple point of water. Thickanddotted lines in (c) correspondto those in (d), showingthe isomorphism of both graphs. Theasterisks (*) abbreviate the appropriateregularities. 26 walk-through example. The boundaries in Figure l(c) represent the limitations on the explorer’s manipulation skills and cannot be crossed until those skills are expanded. consider in detail the examplein Figure 5a, which is both an expansionand a slight simplification of the situation shownin Figure 4. Manynodes in Figure 5a correspond to the same object in Figure 1. For instance, the upper boundary of the regularity R2 is equal to the lower boundary of RI. The commonboundary is represented by the node lb(Rl)&ub(R~) in Figure 5b, and as the node B1 in Figure 5d. Another group of nodes that are identified is the upper boundary of the regularity on upper boundary of R2, the upper boundary of the regularity on lower boundary of Rl, and the upper boundary of the regularity oil upper boundary of RI, all of them corresponding to the same point/3 in Figure 5d. Still another pair of nodes that are identified is the lower boundary of the regularity on upper boundary of R2, and the lower boundary of the regularity on upper boundary of R1. They are put together to form the point a in Figure 5d. In special circumstances, new boundary nodes can be added between similar nodes. Consider two nodes that have the same value of z2, although they cannot be identified, becausethey have distinctly different values of z I (the difference is bigger than the error with which they have been determined). For instance, such nodes are the lower boundaryof the regularity on the lower boundary of Rl, and the lower boundary of the regularity on the upper boundary of Rl. They are marked respectively as a and 7 in Figure 5d. Because a and 7 have different values of zl, they cannot be 2identified. However,they share the same value of z with the lower boundary on Rl(zl,z2), so that lower boundary becomes a new edge in the graph, retaining its link to Rl(zl,z2), and being linked to a and 7 as the lower and upper bound on that edge. The results are illustrated in Figure 5c. Incidentally, this new boundary coincides with the space boundary. Conclusions In this paper we concentrated on a knowledge graph incrementally discovered by FAHRENHEIT and on demonstrating how the goals for further discovery are guided by that representation. Experimentation is intermingled with the theoretical analysis, and new results are automatically integrated into the graph. We presented a graph reduction schemawhich has two major purposes. First, it reduces the numberof goals and makes each goal unique. Second, the knowledge graph models the topology of laws and their boundaries in the parameter space. References Falkenhainer, B.C., & Michalski, R.S. 1986. Integrating Quantitative and Qualitative Discovery: The ABACUS System, Machine Learning, I, 367-401. Langley, P.W., Simon, H.A., Bradshaw, G., & Zytkow J.M. 1987. Scientific Discovery; Aa Account of the Creative Processes. Boston, MA:MITPress. Nordhausen, B., & Langley, P. 1990. An Integrated Approach to Empirical Discovery. in: J.Shrager & P. Langley (eds.) Computational Models of Scientific Discovery and Theory Formation, Morgan Kaufmann Publishers, San Mateo, CA, 97-128. Scott, P.D., Markovitch, S. 1993. Experience Selection and Problem Choice In An Exploratory Learning System. Machine Learning To be published. Shen, W.M. 1993. Discovery as Autonomous Learning from Environment. Machine Learning To be published. The new discovery tasks Newdiscovery goals follow from the knowledgegraph depicted in Figure 5d: 1. Generalize the graph to another dimension. Each instance of generalization corresponds to one of the remaining independent variables za, ...,zN. In our example, no variables are left, because zI and x2 are the only variables, so this goal is not created. 2. Find-a-new-area operator looks for "seeds" which are attached to various boundariesin the formof collections of tuples of independentvalues. Each set of seeds can be used to start the new experiments in search of a new regularity. For instance, if RI and R2 have been discovered, as described in Figure 5c, then the search for a newarea returns few points in the area of R3, which are a byproduct of the boundary search and from which R3 can be gradually discovered. The exploration continues until all areas are covered by regularities. This corresponds to the state of knowledgedepicted by the graph in Figure l(c) for our Wu, Y. and Wang, S. 1989. Discovering Knowledge from Observational Data, In: Piatetsky-Shapiro, G. (ed.) Knowledge Discovery in Databases, IJCAI-89 WorkshopProceedings, Detroit, MI, 369-377. Zembowicz, R. & Zytkow, J.M. 1991. Automated Discovery of Empirical Equations from Data. In Ran. Z. & ZemankovaM. eds. Methodologies for Intelligent Systems, Springer-Verlag, 1991, 429-440. 7.ytkow, J.M., Zhu, J. & Hussam, A. 1990. Automated Discovery in a Chemistry Laboratory, Proceedings of the AAAI-90, AAAIPress, 889-894. Zytkow, J.M., Zhu, & Zembowicz, 1992. Operational Definition Refinement: a Discovery Process, Proceedings of the Tenth National Conferenceon Artificial Intelligence, AAAIPress, 76-81. 27