Experimentation Guided by A Knowledge Graph

advertisement
From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Experimentation
Guided
by A Knowledge
Graph
Jan M. 7,ytkow
and dieming
Zhu
zytkow@wise.cs.twsu.edu
jxzhu@data.cs.twsu.edu
Department of Computer Science
Wichita State University,
Wichita, KS 67260 U.S.A.
Introduction
Discoverers always seek the unknown. They examine
the world around them, and ask: what are the boundaries that separate the known from the unknown?
Then they cross the boundaries to explore the world
beyond. Machinediscoverers can use the same strategy.
Wewill discuss a knowledge representation mechanism that makes it easy to find the unknown. In
this representation, the current state of knowledgecan
be examinedat any given time to find all new directions for future discovery. Wecan call this approach
a knowledge-driven goal generation. Each state of
knowledgecan be transcended in different directions,
so that goal generator typically creates manygoals and
should be followed by goal selector. In this paper we
abstract from goal selection and consider only the indeterministic mechanismof goal generation.
A big advantage of our knowledgerepresentation lies
in separating the knowledge from the method. Goals
correspond to the limitations of knowledge. They can
be generated by examiningthe current state of knowledge, while the selection of the methodfor each goal is
a separate issue. Newdiscoverers can use new methods and are not bound by methods used by their predecessors. Wewill present the interaction between
our knowledge representation mechanism and mechanism for goal generation. Those mechanisms, added
to the FAIIRENHEIT
discovery system (Zytkow, Zhu
& Hussam, 1990), made the system more flexible and
improved the efficiency of exploration. Applications
of FAIIRENHEITinclude, among others, automated
experimentation and discovery in a chemistry laboratory (Zytkowet.al., 1990; Zytkow, Zhu, &Zembowicz,
1992).
Twoother machinediscoverers use experience selection mechanisms different from FAHRENHEIT:
LIVE
(Shen 1993) and DIDO(Scott & Markovitch 1993).
LIVEcombines
exploration
withproblem
solving.
The
needs
oftheproblem
solver
andfailed
predictions
drive
themechanism
whichselectsnewexperiments.
LIVE
isanoptimistic
generalizer
anditcangenerate
a universal
theory
fromoneexperience.
Because
it induces
23
universal theories, it does not see the reason to search
for boundaries before it realizes a failed prediction.
FAHRENHEIT
is more conservative and it literally expects the limits to everything. Wheneverit reaches
any regularity R, its goal generator suggests to determine the limitations of R. So FAHRENHEIT,
in distinction to LIVE, actively seeks the boundaries and
then makes experiments beyond the boundaries to get
empirical data for new theories. DIDObrings the probabilistic perspective to the exploration, absent both in
LIVEand in FAHRENHEIT.
It expresses its theories
as a networkof probabilistic rules. It refines its rules
to make them as deterministic as possible, so it may
eventually find both the deterministic regularities and
their boundaries, but after a brief initial phase, the
combinedscope of the rules covers the whole space, so
there is no need to search for areas beyond the boundaries. Instead, DIDOsearches for the areas of greatest
uncertainty and tries to improverules in those areas.
Both LIVEand represent the scope of each rule by
a conjunction of conditions on single attributes. In
contrast, FAHRENHEIT
can find sophisticated boundaries, which may be described as functions of many
variables. Phase transitions in physics or chemistry
are typically described by such boundaries.
If manypartial regularities are present, LIVE,DIDO
and FAHRENHEIT
can discover them gradually, continuing the search until they reach an empirically complete deterministic theory. In contrast, BACON
(Langley et al, 1987) was limited to situations in which
single multidimensional regularity holds for all datapoints, so the exploration goals were simple. Several
other systems, including ABACUS
(Falkenhailmr
Michalski, 1986), IDS (Nordhausen & Langley, 1990),
and KEPLER(Wu & Wang, 1989) do not conduct autonomous exploration and must be fed with data. Unlike FAHRENHEIT,
theyareunableto searchfornew
unexplored
areas.
In thispaperweconcentrate
ontherepresentation
of
a theoryincrementally
discovered
by FAHRENHEIT.
A theoryis organized
as a knowledge
graph,which
models
thetopology
oflawsandtheirboundaries
in the
spaceof variablescontrolled
by FAHRENHEIT.
We
xZ
-X2
Figure2(a). In a onedimensional
sequence
of data I isvaried,
x2 is constant),FAHRENHEIT
discoversa regularity, the IoweKIb)andupper(ub)boundary
for the regularity, andtwocollections (seeds2 andseeds3)of data whichdonotfit the regularity.
butfromwhichthe furthersearchfor a newregularitycanstart.
Goal 1
~enqrali~
~o~" /
/
Goal 2
/
FigureI. FAHRENHEIT
candiscovermultipleregularities an</
boundariesbetween
themin a multi-dimensional
data space.(a)
OurwaJk-thmugh
example.0>) Theintermediatestate: someregularities andtheir boundaries
havebeendiscovered.(c) Thefinal
state of knowledge
demonstrate
how the graphanalysis
leadsFAHRENHEITto newdiscovery
goals.Newdiscoveries,
made
inresponse
tothosegoalsleadtothegrowing
graphof
nascent
knowledge.
Empirical
exploration
of a numerical
space
Consider N independent variables zl,.. ., z N and one
(for simplicity) dependentvariable y. Each variable
limited in the scope between a minimumand a maximumvalue. Weconcentrate on numerical parameters,
because they are very important in natural sciences,
and because we use the properties of topological closeheSS and topologically coherent regions in a numerical
space.
Experiments are the only way for obtaining information about E. Experimenter can set the values
of all independent variables and measure the values
of the dependent variable. Each experiment consists
in enforcing a value for each independent variable
zi,i - 1,...,N, and in reading the value of y. The
task of an autonomousempirical discoverer is to generate aa complete and empirically adequate theory of
E as possible, including regularities for the dependent
variable and boundaryconditions for those regularities.
As an example, we consider a space of two inde-
24
:-- --"
Goal 3
" upI~r!
! lower!-gl---I~uI~ity
foun~.
"..bou
nd .a~~ \bbund~/
Figure20>).Aftera regularityhas beendiscovered,
in the
nextstep, eitheroneof thethreegoals,illustratedas circles,
can be chosen.
pendent variables z1 and z 2, with regularities Rl, R2,
and Rs divided by boundaries Bl, B2, and B3, depicted in Figure l(a).. Manysimilar diagrams can
found in natural sciences publications. If Figure l(a)
is interpreted as a phase diagram for water, RI, R2,
and //3 are state equations for ice, water and steam,
the boundaries indicate melting/freezing and evaporation/condensation, while/~ indicates the triple point
of water in which steam, water, and ice are in equilibrium. Knowledgeof the exact location of such points
is an important part of theories. Wedo not show the
Y dimension in our figures, because in this paper we
are not concerned with the form of equations RI, R2,
and R3. Weare concerned with boundary equations
BI, B2, and B3, and those are expressed in terms of
independent variables.
Incremental
construction
of knowledge
of the
graph
In FAHRENHEIT,
discovery of a complete theory consists of manysimple steps, performed repeatedly, and
falling into few generic categories. Each task is triggered by a specific incompleteness in the graph of
knowledge.
ub
lb P3 P4 ub lb P5 P6 ub lb l~ Ps ub lb P9 PIo
2......................
,X
Figure3. Exploringdifferent valuesof x2, FAHRENHEIT
discovers a sequence
of one-dimensional
regularitiesandtheir bound2.
aries, andthengeneralizesthemto the seconddimension
x
Regularity
in one independent
variable
When FAHRENIIEITknows nothing about the do1,
main, it selects one independent dimension, say z
while z~ and other independent variables are kept constant. FAHRENHEIT
typically conducts experiments
in sequences. Based on the initial value of zI and the
incrementfor z1, it makesn values (z~, i = 1,..., n)
zI and reads the corresponding values of y, leading to
the data set (zl,yi),i = 1,...,n. Such a data set can
be treated as a lookuptable, so it is already a piece of
knowledge. However, FAHRENHEIT
looks for equations relating zI and y, using the equation finder (EF)
described by Zembowiczand Zytkow (1991).
After EF finds an equation, the knowledge graph
consists of one node, indicated by the central node
Rl(z l) in Figure 2(a). At this time, three goals becomethe candidates for the next discovery step. They
are shownin Figure 2(b). If the search for the upper
the lower boundaryis selected, additional experiments
follow, until the boundaryis found.
After two boundariesare detected, that is, after goal
2 and goal 3 in Figure 2(b) have been completed, the
results form the graph depicted in the lower part of
Figure 2(a). Somedata generated in the process
boundarysearch do not fit the regularity. They become
the seeds from which the search for regularities can
continue (seeds2 and seedsa in the graph).
At this time, the graph in Figure 2(a) yields three
goals:
1. generalize the regularity Rl(z1) to z2;
2. find regularity for zI in the area of seeds2;
Figure4. Exponential
complexity
of the knowledge
tree in
BACON
and FAHRENHEIT.
(BACON
in thicker lines). Pl, P2 .....
pa are parameters
in the regularities,suchas slopeandinterceptin
a linear equation.Forsimplicity,weassume
twoparametersper
regularity. Horizontallinks showrelationshipsbetween
regularity
andits boundaries.Forsimplicity,weshowonlythe horizontal
linksat 1.
thelevelx
Recursive generalization
of empirical
regularities
Generalization to z2 uses the process described in Section 3.1, but applies it to z2 as independentvariable,
while the dependentvariables are all parameters in the
regularity Rl(z l) and the boundaries of Rl(zl). Such
a process is repeated for each new value of z z. Step
after step, a numberof one dimensionalregularities for
z~ are discovered, the boundaries of those regularities,
and the seeds in the areas outside the discovered regularity, as depicted in Figure 3.
Whenenough values of parameters in the equation
Rl(z l) are collected, Rl(z 1) can be generalized to
Rl(zl,z ~) by application of EF to each parameter in
Rl(z 1) (cf. BACON,
Langley et.al., 1987). After each
equation for z~ is discovered, the newgoals are to seek
the upper boundary in za and the lower boundary in
z 2 for that equation. For each of these goals, the experimentation proceeds according to the experimental
strategy of the boundary search. The results are depicted in Figure 4.
Representation
of knowledge by a tree
Initially, the generalization of a graph of knowledge
fromz1 to z-~ producesa tree-like structure, depicted in
Figure 4. That tree resembles only remotely the topology of regularities and their boundaries, illustrated in
Figure la. Manynodes in that tree represent the same
line or the same point in Figure la, so that, whenFigure 4 is used to set new discovery goals, the samegoal
will be created manytimes. Wewill now discuss reduction of the tree to a graph, depicted in Figure 5d.
Tree reduction
to a graph: identification
of nodes
Initially, the generalization produces a tree of nodes,
1as depicted in Figure 4, including the links at the z
level, betweenparameters in the regularity RI (z1) and
between regularity Rl(z1) and its boundaries. Wewill
3. find regularity for z1 in the area of seeds3.
Eachof these goals can be selected. The actual history of the exploration depends on the preference in
goal selection. Should goals 2 or 3 be picked up, the
exploration would be similar to one described above.
25
.. m..
(a)
p
,. ........... , ..............................
sp~
.,,°
........
s , ........
"’,..-.....
.
....’°
~~ ....... .....
,
.,....
....
" . R..
V .....
," ,, """
,,
,, .....
.~.,
"’.,
~ .....
li’ ,,r’"’:,l,’it ":,,,,!llI"
¯
:|
.
XZ
Ix
i’"’"’°°’"
-- .°oO. ................... "° °° ...........R ...............
.......
¯
’,.,°°
.....
B
X2
!
X
(C)
( newedge, lower limit on Rl(xl, x2) )
. ’l°’°’"
,2 .o .............
° .....................
~,,.,.,..,,..,,
B
1x
(d)
Figure 5. WhenFAHRENHEIT
generalizes the regularities andboundariesto x2, a tree of newregularities and boundariesis created,
shownin (a). Manynodesat the 2 l evel are i dentified, creating intem~ediateraphs
g
shownni (b) and (c). Thick lines in (b). (
(d) indicate links betweenregularities andtheir boundaries.Dottedlines indicate links betweenboundarieson regularities and
boundariesof boundaries.Boundarieson boundarieshavephysical interpretation of phase equilibria, such as the triple point of
water. Thickanddotted lines in (c) correspondto those in (d), showingthe isomorphism
of both graphs. Theasterisks (*) abbreviate
the appropriateregularities.
26
walk-through example. The boundaries in Figure l(c)
represent the limitations on the explorer’s manipulation skills and cannot be crossed until those skills are
expanded.
consider in detail the examplein Figure 5a, which is
both an expansionand a slight simplification of the situation shownin Figure 4. Manynodes in Figure 5a correspond to the same object in Figure 1. For instance,
the upper boundary of the regularity R2 is equal to
the lower boundary of RI. The commonboundary is
represented by the node lb(Rl)&ub(R~) in Figure 5b,
and as the node B1 in Figure 5d.
Another group of nodes that are identified is the
upper boundary of the regularity on upper boundary
of R2, the upper boundary of the regularity on lower
boundary of Rl, and the upper boundary of the regularity oil upper boundary of RI, all of them corresponding to the same point/3 in Figure 5d.
Still another pair of nodes that are identified is the
lower boundary of the regularity on upper boundary of
R2, and the lower boundary of the regularity on upper
boundary of R1. They are put together to form the
point a in Figure 5d.
In special circumstances, new boundary nodes can
be added between similar nodes. Consider two nodes
that have the same value of z2, although they cannot
be identified, becausethey have distinctly different values of z I (the difference is bigger than the error with
which they have been determined). For instance, such
nodes are the lower boundaryof the regularity on the
lower boundary of Rl, and the lower boundary of the
regularity on the upper boundary of Rl. They are
marked respectively as a and 7 in Figure 5d. Because
a and 7 have different values of zl, they cannot be
2identified. However,they share the same value of z
with the lower boundary on Rl(zl,z2), so that lower
boundary becomes a new edge in the graph, retaining its link to Rl(zl,z2), and being linked to a and
7 as the lower and upper bound on that edge. The
results are illustrated in Figure 5c. Incidentally, this
new boundary coincides with the space boundary.
Conclusions
In this paper we concentrated on a knowledge graph
incrementally
discovered by FAHRENHEIT
and on
demonstrating how the goals for further discovery are
guided by that representation. Experimentation is intermingled with the theoretical analysis, and new results are automatically integrated into the graph. We
presented a graph reduction schemawhich has two major purposes. First, it reduces the numberof goals and
makes each goal unique. Second, the knowledge graph
models the topology of laws and their boundaries in
the parameter space.
References
Falkenhainer, B.C., & Michalski, R.S. 1986. Integrating Quantitative and Qualitative Discovery: The
ABACUS
System, Machine Learning, I, 367-401.
Langley, P.W., Simon, H.A., Bradshaw, G., & Zytkow
J.M. 1987. Scientific Discovery; Aa Account of the
Creative Processes. Boston, MA:MITPress.
Nordhausen, B., & Langley, P. 1990. An Integrated
Approach to Empirical Discovery. in: J.Shrager &
P. Langley (eds.) Computational Models of Scientific
Discovery and Theory Formation, Morgan Kaufmann
Publishers, San Mateo, CA, 97-128.
Scott, P.D., Markovitch, S. 1993. Experience Selection and Problem Choice In An Exploratory Learning
System. Machine Learning To be published.
Shen, W.M. 1993. Discovery as Autonomous Learning from Environment. Machine Learning To be published.
The new discovery
tasks
Newdiscovery goals follow from the knowledgegraph
depicted in Figure 5d:
1. Generalize the graph to another dimension. Each
instance of generalization corresponds to one of the
remaining independent variables za, ...,zN. In our
example, no variables are left, because zI and x2 are
the only variables, so this goal is not created.
2. Find-a-new-area operator looks for "seeds" which
are attached to various boundariesin the formof collections of tuples of independentvalues. Each set of seeds
can be used to start the new experiments in search of
a new regularity. For instance, if RI and R2 have
been discovered, as described in Figure 5c, then the
search for a newarea returns few points in the area of
R3, which are a byproduct of the boundary search and
from which R3 can be gradually discovered.
The exploration continues until all areas are covered by regularities. This corresponds to the state of
knowledgedepicted by the graph in Figure l(c) for our
Wu, Y. and Wang, S. 1989. Discovering Knowledge
from Observational Data, In: Piatetsky-Shapiro, G.
(ed.) Knowledge Discovery in Databases, IJCAI-89
WorkshopProceedings, Detroit, MI, 369-377.
Zembowicz, R. & Zytkow, J.M. 1991. Automated Discovery of Empirical Equations from Data. In Ran. Z.
& ZemankovaM. eds. Methodologies for Intelligent
Systems, Springer-Verlag, 1991, 429-440.
7.ytkow, J.M., Zhu, J. & Hussam, A. 1990. Automated
Discovery in a Chemistry Laboratory, Proceedings of
the AAAI-90, AAAIPress, 889-894.
Zytkow, J.M., Zhu, & Zembowicz, 1992. Operational
Definition Refinement: a Discovery Process, Proceedings of the Tenth National Conferenceon Artificial Intelligence, AAAIPress, 76-81.
27
Download