From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. PolyAnalyst. A MACHINEDISCOVERYSYSTEM INFEJ RING FUNCTIONALPROGRAMS. Mikhail V.Kiselev ComputerPatient Monitoring Laboratory at National Research Center of Surgery, per.Abrikosovski, 2, Moscow119874, RUSSIA e-maiL"geography@glas, ape. org ABSTRACT The present paper describes the learning technique used in PolyAnalyst- the systemof machinediscoveryand intelligent analysis of the experimental/observationaldatawhichhas been created in the Computer Patient Monitoring Laboratoryat theNational Research Center ofSurgery. Po!yAnalyst Is a multi-purpose system designed to solve thefollowing classes ofproblems: I)construction ofa procedure realiT.ingthe mapping fromthe set of desCriptions to the set of parameters given by the pairs <description, parameter>; 2) search for the interdependencesbetweencomponenm of the description; 3) search for characteristicfeaturesof agivenset of descriptions.Herethe description ¯ meansa single experimental/observational data recordandit is assumed that all the recordsin the sameset of observations havethe samestructure. Thepeperis devotedmainlyto P01yAnalyst’s application to the ~rpe1 problems,This type includes classification, empirical lawinference, choiceof the best decisionfroma fixed set of possibledecisions, and other tasks. To solve a problemPoIyAnalystconstructs and tests programs ona simple functional programming:language whose inputs are thedescriptions andoutputs arethecorresponding parameter values. While searching forthe solutionPolyAnalyst combines fullsearch, heuristicalsearch, anddirect cons~’uction oftheprograms. Since thefirst version of PolyAnalyst wascreated it hassolved a number of real problems from chemistry, medicine, geophysics, andagricultural science. Oneexample is givenin the paper- the predictionof the elasticity of the polyethylenesamplesfromtheir infra-red spectrums. KEYWORDS: machinediscovery, automated knowledgeacquisition. 1. Introduction Since 60s whenmachinelearning began existing as a science the great diversity of methodsand systems has been proposed varying in the form of training examples, the degree of autonomy,the use of existing formalized knowledgeabout application domain, the form of learned information, and other features. However the "density" of research is not the sameover all the rangeof the machinelearning tasks. In particular this density is still lowin the KDD-94 AAAI-94Workshop on Knowledge Discovery in Databases Page237 fields where the problems of recognition of the complex multi-factor interdependencesin the bodies of the experimental/observationaldata should be solved under condition of sparse or absent model knowledgeand under ¯ minimumguidance from a human.It is especially true for the cases, whenthe structural properties are important and. the possibility to performdesired experimentsis limited. Themaindifficulty of these problemsis determinedby the fact that since the type of sought dependenceis not knowna priori the solution has to be sought in the space of proceduresrealizing various dependences.In this situation the possibility to constructthe solutiondirectly is limited, therefore the search in very wide spaces should be used. Althoughon the other hand the search in the se~ of procedurescan be combinedwith the use of more effective methodsin sometheir widesubsets. For example,they maybe the sets of proceduresequivalent to the predicate calculus formulaeor to the rational functions. The present research is devotedto use of methodsbased on the search in the space of procedures for solution of real problemsof the empirical law inference and machinelearning. Its aims include the effective search strategies, ¯ the principles of selection of the best search directions, andthe methodsof the direct synthesis of desired procedures. Thelast years are markedby significant increase of research activity in the field of automatedknowledgeacquisition, machinediscovery, and automated inference of models.The present paper does not include a thoroughliterature review -it wouldrequire a.lot of space. Goodreview of research before 1989 can be found in (MacDonald,Witten, 1989). There exists a numberof systems whichinfer the dependencesbetweenvariables constituting training examples (as a rule in the form of rational functions). Themost famousones are BACON (Langley, Zitkov, Bradshaw, Simon, 1986), ABACUS (Falkenhainer, Michalski, 1990), FAHRENHEIT (Zitkov, Zhu, 1991). The description another interestingmachinediscovery technique can be found in (Phan, Witten, 1990). There are various approacheswhichuse the extensions of the classic declarative formalisms to achieve moreexpressive power(see, for example, extension of the predicate calculus, in (Michalski, 1993)). The system whichsynthesizes functional programsin the process of search for the solution - ls describedin (Shen, 1990). Theproblemof discoveryof structural regularities in data is close to the grammarinference problem- see, for example,(Wolff, 1987). At last manyworksare devotedto the theoretical aspects of problemof the recursive functions inference (Jantke, 1987). Summarizingit should noted that all these workstouch only different instances of the general problem of the programsynthesis as the universal approachto learning from examples Page238 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94 and machine discovery. Besides that only few systems (for example, FAHRENHEIT) are reported to have the real-world applications. The present paper is an attempt to showthat the search in the space of procedures built from the domain-specific sets of primitives with the help of the universal production schemes can serve as a basis for a universal data analysis system capable of solving the real problemsfrom the various fields. The paper is organized in the following way. Section 2 contains the description of the functional programming language in which PolyAnalyst formalizes constructed procedures. Section 3 explains howthe search in the space of procedures is organised. Methodsused to avoid the tautologies and the families of equivalent procedures are enumerated. In Section 4 the heuristics which determine the search priorities are described. Section 5 is devoted to the principles of direct construction of proceduresby subdividing the initial task to a number of the simpler subtasks. Since almost every constructed program contains constants it in general represents a family of procedures. This fact makes the evalutaion of a new program the non-trivial task. The evaluation procedure is described in Section 6. At last Section 7 is devoted to one of / PolyAnalyst’s applications. 2. The internal functional programminglanguage. The PolyAnalyst system synthesizes procedures. A procedure can be described as an object with some number of inputs and outputs (number of outputs should benot less than I). Thesimplest procedures are called primitives. Sets of primitives can be different for different tasks and are determinedby the structure of analysed data and features of application field which are specified in the domain definition (DO) file (see Section 7). Every input and output procedures has a data type associated with it. It should be one of the data types defined for the given task in the DDfile. Newprocedures are constructed from existing ones with the help of several (4 in the present version) production schemes (Kiselev, Kiseleva, 1990). These production schemes and other features of the language resemble John Backus’ FP formalism (Backus, 1978). The production schemesare illustrated by the diagrams below in which the solid rectangles denote procedures used to build the newprocedure and the lines with the arrows denote the data streams. KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 239 a. Functional composition. Input and output data types should match. It is also true for the other production schemesbelow. b. Conditional execution. If the value of the CONDinput is true then execute the procedureelse copy some input values to outputs. c. Unconditional loop. Put on some inputs of the procedure all possible combinations of values admissible for their data types subsequently and sumall values returned by the procedure. Naturally, the data types associated with these inputs should be discrete and finite. This constructionis suitable to realize the predicate calculus quantors, integration, and other operations. d. Conditional loop. Executethe procedure and test the value of its output markedas COND. If it is false then stop else copyvalues fromoutputs to inputs and begin newiteration. The outputs of the newprocedure are all the outputs of the old procedureexcept the COND output (which is always false at the end of execution). t 3. The search. For every procedure built by the PolyAnalyst system a measure of its complexity is defined. The procedures are built in the order of increasing complexity.The heuristical search subsystemof the PolyAnalystcan reevaluate this complexity measurein accordance with certain heuristics (see the next Section) and due to this mechanismcan change the search order. Thus, this Page 240 AAAI-94 Workshopon Knowledge Discovery in Databases KDD-94 value called dynamicalcomplexityis a sumof the true complexityof procedure and someheuristical corrections. The true complexityof a procedureis a sum of the complexities of the primitives entering it and the complexities of the production schemeswhichhave been used for its creation. It is obvious that to be operable the system should have the mechanisms whichwouldrecognize tautologies and equivalences of procedures and exclude tautological or equivalent proceduresfromthe search. Further the tautologial procedures and procedures which are equivalent to ever found ones will be denoted by the term ’reducible’. It should be noted that these mechanisms are especially important for the PolyAnalystbecause of the strong combinatorial growth of synthesized procedures. During the period of creation of the PolyAnalystthe great efforts weremadeto provideit with the effective multilevel system sifting awaythe reducible procedures. The performedtests show that the existing mechanism recognizes and eliminates practically 100 %of the reducible proceduresin the first 7 generations. It includes the following components. ¯ a. The production schemesare applied in the order whichexcludes the explicit "syntactic" equivalences. For example,if wehave the three procedures A, B, and C, all with one input and one output then the procedurecorresponding to their composition A(B(C(.))) could be built by the two ways - as a composition of A(B(.)) and C(.), and as a composition of A(.) and B(C(.)). The system recognizes such situation and uses the second wayonly. b. Everyinput and output of proceduresis markednot only by its data type but also by the other attribute called the ’algebra type’. This attribute helps to avoid the "local" equivalences based on the identifies whichinclude several primitives. For example, the primitives "+", "-", and "-" are boundby the identity (a+b)~c <-> b-(c-a). The program which realizes the "functional composition" production schemecontains the rules which prohibit certain combinations of input and output algebra types. In the examplementioned above the system builds only the left hand side procedure because the output algebra type of the "-" primitive and the input algebra type of the "-" primitive are incompatible. c. Symmetry of the primitive inputs can be specified. The mechanism which builds newprocedures determlnes the symmetries of constructed procedure knowingthe symmetriesof its components.It allows to eliminate the "symmetrybased" equivalences such as a&b<-> b&a.Since the primitive "&" is declared symmetriconly one variant is built. d. It is possible to declare that connectionof two inputs of the certain primitive with the samevariable (an output of someprocedureor a loop variable) KDD-94 AAAI.94 Workshopon Knowledge Discovery in Databases Page 241 leads to tautology. The system will not build the procedures contaning such fragments. For example, a&a<-> a. e, It is the consequenceof the general results of the theory of algorithms that in general the equivalnceof two procedurescannot be provenby recursive use of a finite set of rules like in a.-d. For this reasonthese exact "syntactic" methods are supplemented by the inexact methods based on the selective execution of the procedurefor various combinationsof the input values. These methodsare called inexact in the sense that while they are able to recognizeall the reducible procedures some not reducible procedures may be also erroneously determinedas reducible. Howeverin practice the risk to reject a "good" procedure can be minimizedto reasonable values by increasing the numberof performedrandomtests. Oneof these methodsdetermines if some input of a proceduredoes not influence its output. If the test executions show that the output value of a proceduredoes not dependon someof its input for all tried combinationsof values on the other inputs the procedure is declared ¯ tautological and discarded. f. For all newproceduresthe returned values are calculated for definite set of the standard input value combinations. The returned values are used to calculate someprocedure’scharacteristic value called procedure’ssignature. If two procedures have equal signatures they maybe equivalent. In this case a numberof tests is carried out andif in eachtest the values returned by the both procedures are equal the procedures are considered equivalent and the more ~omplexone is discarded. 4. Heuristics determinin~the search priorities. Theheuristical rules of the dynamicalcomplexityre-evaluation used in the PolyAnalystsystemare divided into two cathegories. The heuristics from the first cathegoryre-evaluate the dynamicalcomplexityof procedureson the basis of their certain structural features. Their importanceis determinedby the fact that not all synthesized procedures can be a solution even potentially. For example, the procedurea &(b-c) constructed from two primitives "&"and "-" can not be a solution becauseit has no access to the analyzeddata. To become acandidate for a solution (such proceduresare called the completeprocedures) it should be linked with someprocedureswhichhave access to the data through the special data access primitives. It is natural that percentageof incomplete proceduresincreases with the growthof complexityof built procedures. If not to take appropriate measuresthe systemwastes moreand moretime constructing the incomplete procedures which will never be completed. The heuristics Page 242 AAAI-94 Workshopon Knowledge Discovery in Databases KDD-94 increase the dynamicalcomplexityof the incompleteproceduresproportionally to numberof steps necessary for their completion. The heuristics from the secondcathegory evaluate a procedureapplying it to the training examples.Theycan decrease as well as increase the dynamical complexityand can changeradically the search priorities. Thethree of themare most important: If a procedure has been found to be a solution of sometask in the given application field (or it maybe one of the subtasks - see the next Section) decrease its dynamicalcomplexityand to a lesser degree compexityof its parentsandchildren. Thelast two heuristics use a parameterof procedurecalled the enthropyof returned values (ERV).To calculate this numbera procedureis applied to all training examplessubsequently. ThenERVis the enthropy of the distribution of these values. ¯ If a procedure has low ERVvalues for manycombinations of constants entering it increase its dynamicalcomplexity. The third heuristics uses the ratio ERV/ERVrand where ERVrandis ERV calculated not for the real examples but for generated set of the random examples. The randomexampleshave the samestructure as the real examples but consist of randomvalues distributed by the samelaw as respective values in the real examples. ¯ If a procedure has high ERV/ERVrand value decrease its dynamical complexity. It should be noted that the case whenERV/ERVrand is close to zero may be also interesting. It meansthat the built procedureexpresses somespecific property of the analyseddata. Findingsuch characteristic properties can be the main aim in sometask. The additional opportunities to recognize the valuable procedures appear in the case whena measureof closeness of found procedureto the solution can be defined. This case ~includes the problemofinference of empirical dependences in bodiesof data belonging to the "continuous" domainswherethe real numbersplay important role. In this class of problemsthe standard error of predictionis a natural measureof closenessto the solution. Thefollowingsearch strategy has provento be effective under such conditions. Twosearch processes run concurrently. Thefirst one is the search process describedin the previous section. It can roughlybe called the breadth-first search. Thesecondone takes KDD-94 AAAI-94 Workshopon Knowledge Discovery in Databases Page 243 as a starting point the best (in terms of standard error)procedurefoundby the first process, For any procedure a set of transformations can be determined whichdo not increase the standard error. Thesetransformations are called not worsening transformations (nWT). The second search process applies all possible nWT’sto the best procedure. The derived procedure with the least standard error and the least numberof the additional parametersis considered the best obtained solution. Whenthe first process finds newbest procedurethe secondprocess takes it as a newstarting point and performs newnWT-search. Theoperation continues until desired standard error level is reached. S. Sub .ms~.anddirect construction of procedures. Despite the heuristics and other mechanismswhichhelp to cut off not perspective search branchesit is impossibleto reach the solution if it is very complexnot using some effective direct methods. The following schemehas provento be effective in manycases: search -> task subdividing-> attempts to solve the subtasks -> construction of the solution from the solutions of the subtasks. At present weuse the followingtask subdividingstrategies. a. Supposethat the systemhas found a numberof procedureseach of which solves the task not over the wholeset of training examplesbut over someits ~ubsets so that all these subsets together overlay the full set of examples completely.Thenthe systeminitiates the newtask of classification of training examples to these subsets. If it finds the solution of this subtaskthen the solution of the initial task is composed fromthe subtasks solutions in the obviousway. b. Supposethat the systemhas found a procedurewhichreturns the correct value being applied to each training examplebut with different values of the constants entering it. In this case the system can try to find the procedures realizing the mappingfromthe set of training examplesto the desired values of the constants, After successful solution of these subtasks the final solution is constructed through the "functional composition" production scheme(a. in Section 2). C. Supposethat the samesituation occurs as in b, The system can choose the other wayto use it. It forms newset of training examplesadding to the existing examplesthe respective values of procedure’sconstants. Assuming that all newexamplesbelongto someclass it tries to solve the respectiveclassification task. If the systemfinds its solutionthe solutionof the initial task is constructed through the "conditional loop" production scheme. Page 244 AAAI-94 Workshopon Knowledge Discovery in Databases KDD-94 6. Evaluation procedure. After a newprocedure has been constructed it is compiled to effective machinecode. Whenit evaluates a newprocedurePolyAnalystcalls it like the C languagefunction. As was mentionedin Introduction since a constructed procedurecan include the constants it realizes a set of different dependencesparametrizedby values of these constants. To determineff the set contains the solution is therefore non-trivial problem. To solve this problem the present version of the PolyAnalystsystem scans over all combinationsof the discrete constants and for each combinationit performshill-climbing in the space of the continuous constants. : 7. Example0f .PolyAnalyst’s application. Dependenceof the mechanical properties of p01ymereson their iR:spectrums The PolyAnalyst system has solved a number of problems in chemistry of polymeres. For example, it is given with a numberof the IRspectrumsof the polyethylene films and the values of the elasticity of the corresponing samples (see Fig.l). The system should find a formula expressing dependence of elasticity on some spectral parameters. This task was solved by the PolyAnalyst version which worked only with discrete data. Therefore the spectral curves had to be transformedto somesets of discrete parameters. u i i i The problem of discreFig, 1. The examplesof the IR-spectrumsof the ~ization of continuous polyethylene films, The numbersare values of data appears in manyAI elasticity of the respectivesamples. KDD-94 AAAI-94 Workshopon Knowledge Discovery in Databases Page 245 tasks - see an exampleof the discre~ation procedure in (Catlett, 1991). used the following scheme. The investigated spectral band was broken to NX intervals. For eachinterval the averageintensity, the intensity dispersion, and the numberof peaks were calculated. Thenthe range of intensity was also broken to Ny intervals. Correspondingly the average intensities and the intensity dispersions were transformedto the integeres from 0 to Ny-1. After this procedure every spectrumwas represented by the three-sectional raw of the integers. The first section contained NXvalues of discretized average intensities, the secondsection contained NXdiscretized intensity dispersions, the third section contained NXvalues of numbersof peaks in the respective spectral interval. To determinethe optimal values of NXand N¥the following consideration was used.: To write the first two sections 2Nxlog2Ny bits is necessary. Wecalculated the information gain (IG) for each spectrumand for different NX,NYvalues and selected those NX,NYvalues which gave the maximum information gain per bit (IG/2Nxlog2Ny). To start search for a solution the systemshould knowthe structure of the examples,the set of primitives, and the set of data types. This informationis supplied in so called domaindefinition (DD)file. TheDDfile for our case shownon the Fig.2. The section DATA_TYPES contains names of data’types and their ranges. In our case Nx - 4, Ny= 11. Here "band" is the spectral interval number.In the section ORDER the properties of the data types are ,specified. This informationis necessaryto generate correct set of primitives. For example,for the "intensity" data type the wholeset of the arithmetical operationsis generatedwhile for the "band"data type only the operators "next", "previous", "equal to", and .DATA_TYPES "less than" are defined. The N-0-10 section QUANTallows the band=O-3 loop variables of types "band" intensity=O-10 .ORDER and "intensity". At last the band-STRAIGHT ASS_INSTRUCTIONS Intensity=,ADDITIVE section fixes the three.OUANT band,Intensity sectional structure of the .ASS INSTRUCTIONS examples.It also contains the b~d->lntensity, of averagelevel in $0 band->L,there Is high dispersion In $0 verbal definition of meaningof band->N,of peaks In $0 -.thesections. Thenames ofdata .END types andexample sections are usedwhenthe obtained deFig.2. The DDfile for the "IR pendence is translated by the spectrums" domain. PolyAnalyst to the "pseudoPage 246 AAAI-94 Workshopon Knowledge Discovery in Databases KDD-94 english" verbal form (see below). It shouldbe added that after the discretization the dispersion values became0s andIs only so that they ***** FB#381 MEANING: #o0 is N $A- $B where $A Is N of peaksin #0 If not $Dthen SBIs $C; else $B Is N $C * #1 where $C is N of peaksIn #2 $Dis there is high dispersion In #3 THIS IS TARGET ll Inputs: 0 4 2 0 are treated as BESTRESULT reached by FB #381 operands: 0 4 2 0 boolean(L). res=-Tg.8*out+398,d: 81.870 To solve this task (represented by spectrums of Fig.3. The "pseudo-verbai"form of solution 47 polyethylene sampleswith knownelasticity) the PolyAnalystcomputedduring several hours. The solution is shownon the Fig.3. At present the verbal representation producedby the PolyAnalystis far fromperfect - it is direct projection of the respective internal language structures. Howeverif to makeall recursive substitutions the followingformulacan obtained. Elasticity- -79.8 * out + 398. _ 81.870where’out’ is the value returned by the following procedure. [Npeaksin b.#2 if low dispersion in b.#0 out = (N of peaks in band #0) - ~else [(Npeaksin b.#2) * ¯It can be expressedalso as out - (Npeaksin b.#0) - Npeaksin b.#2 * (1 + 3 * dispersion in b.#0) ThePolyAnalystis able to determineff the set of training examplesis too small. Oneof the simplest methodsis following. It generates the pairs <spectral data, elasticity> in whichthe both componentsare from the real examplesbut are randomlymixed.Thesametask is solved for this randomizedset also. If the foundsolution gives the close value of standard error the solution of the main task cannotbe consideredreliable. This test is performedfor several different randomizedtraining sets. Beside the tasks from chemistry of polymeresthe PolyAnalysthas solved a numberof real tasks from medicine,geophysics, and agricultural science. KDD-94 AAA1-94Workshopon Knowledge Discovery in Databases Page 247 8. Conclusion. Our research showsthat it is possible to organize effective search in the space of proceduresand use it for the real-world applications. The following factors are the mostimportant for the systemsbased on this approach: ¯ effective mechanisms of avoidanceof trivial and equivalent procedures; ¯ reliable principles of selection of the search priorities; ¯ combineduse of search and direct inference of programs; ¯ combinationof the universal search technique in the procedural spaces and special methodsapplicable to their subsets equivalent to somedeclarative formalisms. The working mechanismswhich realize the first three prnciples in the PolyAnalystsystem were descn’bedin the paper. The mechanismrealizing the effective search technique in the sub-spaceswhichare equivalent to the set of the predicate calculus formulaeand the set of the rational functions is developed now. REFERENCES. BACKUS J. (1978) Can programmingbe liberated from the yon Neumannstyle? Commun. ACM,v.21, pp 613-541. CATLETT J. (1991 Onchanging continuous attributes into ordered discrete attributes, Proceedings of MachineLearning - EWSL-91,pp 164-178. FALKENHAINER B.C., MICHALSKIR.S. (1990) Integrating Quantitative and Qualitative Discoveryin the ABACUS System In: Y.Kodratoff, R.S.Michalski, (Eds.): MachineLearning: AnArtificial Intelligence Approach(VolumeIII). San Mateo, CA:Kaufmann,pp 153190. JANTKEK.P. (Ed.) (1987) Analogical and Inductive Inference, Springer-Verlag. KISELEVM.V., KISELEVAM.N. (1990) Procedural Approach to Learning from Examples, Proceedings of the second Conference"Artificial Intelligence -90" (USSR),vol.3, pp 197200, - In Russian. Page 248 AAAl-94 Workshopon Knowledge Discovery in Databases KDD-94 LANGLEYP., ZITKOYJ.M., BRADSHAW G.L., SIMONH.A. (1986) Rediscovering Chemistry with the Bacon system, Machine Learning, v.2, pp 425-469. MACDONALD B.A., WITTEN I.H. (1989) A Framework for Knowledge Acquisition through Techniques of Concept Learning, IEEE Trans. on Systems, Man, and Cyber., v.19, pp 499-512. MICHALSKIR.S. (1993) A Theory and Methodology of Inductive Learning In: B.G.Buchanan, D.C.Wilkins, (Eds.): Readingsin Knowledge Acquisition and Learning: Automating the Construction and Improvement of Expert Systems. San Mate o, CA: Kaufmann, pp 323-348. PHANT.H., WITTENI.H. (1990) Accelerating Search in Function Induction, J. Expt. Theor. Artif. Intell., v.2, pp 131-150. SHEN Wei-Min (1990) Functional Transformations in AI Discovery Systems, Artif. Intell., v.41, pp 257-272. WOLFFJ.G. (1987) Cognitive developmentas optimization, I n: L.Bolk, (Ed.), Coinputational models of learning, Springer, Berlin, pp 161-205. ZITKOVJ.M., ZHUJ. (1991) Application of Empirical Discovery in KnowledgeAcqusition, Proceedings of Machine Learning - EWSL-91,pp 101-117. KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 249 Page 250 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94