From: AAAI Technical Report SS-02-05. Compilation copyright © 2002, AAAI (www.aaai.org). All rights reserved. AutomaticSynthesisof Statistical DataAnalysis Programs Position Statementm Bernd Fischer RIACS/NASA Ames Research Center fisch@email, arc. nasa. gov Statistical data analysis is a core activity in experimental sciences. It encompassesa wide variety of tasks, ranging from, e.g., a simple linear regression to fitting complex dynamicalmodels. Developingstatistical data analysis programs, however, is an arduous and error-prone process whichrequires profoundexpertise in different areas: statistics, numerics, software engineering, and of course the scientific application domain. Webelieve that statistical data analysis is a very promising domainfor the application programsynthesis, despite-or evenbecause--thesedifficulties. Statistics providesa unifying and concise domain-specific notation. Graphical models (Buntine 1994) provide a structuring mechanismwhich can be exploited during the synthesis process, e.g., to decomposea problem into independent subproblems. Statistical algorithms like EM(Dempster, Laird, &Rubin 1977) are often applicable to a wide range of problems;their generic formulations allow a "plug’n’play"-style algorithm combination. Recently developedsophisticated data structures as for examplekd-trees (PeUeg& Moore1999) offer orders-ofmagnitudespeed-up for certain problems but are rarely employed due to the increased programmingcomplexity they cause. Finally, data analysis is characterized by an iterative developmentstyle: an initial modelis hypothesized, implemented,evaluated on real data, and--if necessary---refined. However,each iteration typically involves substantial programmingefforts as prototyping is often not sufficient to work with real data sets and even small modelmodifications mayrequire radically dfferent algorithms. Programsynthesis encapsulatesmanyof the statistical, numerical,and software engineering aspects of each iteration and thus allows users to concentrate on their scientific application. Its fast turn-around times make model refinement and design-space exploration feasible. We are currently developing AUTOBAYES, a program synthesis system for the generation of data analysis programsfrom statistical models. A statistical modelspecifies the properties for each problemvariable (i.e., observationor parameter) and its dependenciesin the form of a probability distribution. Figure 1 shows the model for a random walk (i.e., a simple noisy dynamical process) in AUTOBAYES’s input notation. The last twolines are the core of the specification; the remaining lines just declare the modelvariables Copyright(~) 2002,American Associationfor Artificial Intelli- 73 model walk as ’RandomWalk’. const nat n_points. where 0 < n_points. double rate as ’drift rate / time slice’. double error as ~drift error / ~iee slice’. where 0 < error. data double x(O..n_points-1). x(I) " gauss(cond(I>O,x(I-1),O)+rate, maxpr(x[rate, error) ~or rate, error. Figure ]: AUTOBAYES specification of random walk and imposesomeadditional constraints on them. The distribution statement x(I) " gaues(cond(I>O,x(I-1),O)+rate, characterizes the observed data z: each observation zi is normal distributed (or Gaussian) with a gradually changing mean and constant but unknownerror. Its mean value is expected to depend on the previous observation zi-1 and an unknown drift rate. Overtime, the data points thus drift awayfrom their origin, which is here modeledas zero. The optimization statement max pr(xl~rate, error}) for Grate, error} characterizes the data analysis task: find the values for the unknownrate and error parameters which explain best the knowndata z, i.e., maximizethe data probability under the parameter values and modeldistributions. From such models, AUTOBAYES generates optimized and thoroughly documented C/C++ code which can be linked dynamically into the Matlab and Octave environments. AUTOBAYES follows a schema-based synthesis approach rather than the direct proof-as-programs-basedapproach. However,schemas can equivalently be understood as theorems of the domaintheory or as generic algorithms, and AUTOBAYES incorporates schemas with a distinctive "theoremflavor" as well as such with moreof an "algorithm flavor." The schemasare organizedinto four different hierarchical layers. The top-most layer contains the most theoremlike schemas,whicharc direct formalizationsof different decomposition theorems for graphical models. These schemas gence(wwwoA_~oorg). All rights reserved. contain almost no codeskeletons. The subsequentlayers then becomemoreand more algorithm-like as their code skeletons growmoreand moredetailed. Theselayers conrain formuladecompostion schemas(e.g., an index decomposition for independentlyandidentically distributed randomvariables), properstatistical algorithmschemas(e.g., EM),andnumericaloptimizationschemas(e.g., the simplex method),respectively. This layeringis not only conceptual but is also usedas a heuristic to controlthe synthesisprocess. Each schemahas a numberof enabling conditions whichare checkedagainstthe current statistical model;each schemaapplication can modifythe modeland thus trigger the application of other schemas.This mechanism allows AUTOBAYES to generate code as a compositionof different schemas,thus "re-inventing"data-analysis algorithms from simple building-blocks. AUTOBAYES augments the schema-guidedapproach by symbolic-algebraic computation and can thus derive closed-formsolutions for many problems and sub-problems. Moredetails on AUTOBAYES and the incorporated schemascan be found in (Fischer, Schumann,& Pressburger 2000) and (Fischer & Schumann 2001). Wehave applied AUTOBAYES to a large numberof textbook examples, machinelearning benchmarks,and NASA applications. Theseencompass suchdiverse tasks as for exampleclustering of rock samplespectra, changepoint detection in "y-raybursts, andsoftwarereliability estimation.Synthesis times weregenerally in the sub-minuterange (on 360MHz Sunworkstation);smallspecifications are typically solved in a few seconds--thesynthesis time for the randomwalkexamplefrom Figure 1 is 1.8 seconds. The generated programswere as long as 2000lines of commented code. This yields leveragefactors fromspecifcationto code between1:10 and 1:30, dependingon the existence (and AUTOBAYES’s ability to find) closed-formsolutions. For the randomwalk example,AUTOBAYES can find the existing closed-formsolution; the generated120lines C-program thus consist mainlyof boilerplate codeandcomments. Acknowledgments AUTOBAYES is a joint developmenteffort with WrayBuntine and JohannSchumann. References W.L. Buntine1994."Operationsfor learning with graphical models".JAIR, 2:159-225,1994. A. P. Dempster,N. M.Laird, andD. B. Rubin1977."Maximumlikelihood from incompletedata via the EMalgorithm(with discussion)".J. of the RoyalStatistical Society series B, 39:1-38,1977. B. Fischer, J. Schumann, and T. Pressburger2000. "Generating Data Analysis Programsfrom Statistical Models (PositionPaper)".In W.Taha,(ed.), Prec. Intl. Workshop Semantics Applications, and Implementationof Program Generation,LNCS1924, pp. 212-229, Montreal, Canada, September2000. Springer. B. Fischer and J. Schumann2000. AutoBayes:A Systemfor GeneratingDataAnalysisProgramsfromStatisti- 74 cal Models,2001.Submittedfor publication. Availableat h~tp://ase, arc. nasa. gov/people/fischer. D. Pelleg and A. Moore 1999. "Accelerating Exact k-meansAlgorithmswith GeometricReasoning".In S. Chaudhuriand D. Madigan,(eds.), Prec. 5th KDD,pp. 277-281,San Diego, CA,August15-18 1999. ACM Press.