Synthesis Automatic of Statistical Data Analysis Programs

From: AAAI Technical Report SS-02-05. Compilation copyright © 2002, AAAI (www.aaai.org). All rights reserved.
AutomaticSynthesisof Statistical DataAnalysis Programs
Position Statementm
Bernd Fischer
RIACS/NASA
Ames Research Center
fisch@email, arc. nasa. gov
Statistical data analysis is a core activity in experimental sciences. It encompassesa wide variety of tasks, ranging from, e.g., a simple linear regression to fitting complex dynamicalmodels. Developingstatistical data analysis
programs, however, is an arduous and error-prone process
whichrequires profoundexpertise in different areas: statistics, numerics, software engineering, and of course the scientific application domain.
Webelieve that statistical data analysis is a very promising domainfor the application programsynthesis, despite-or evenbecause--thesedifficulties. Statistics providesa unifying and concise domain-specific notation. Graphical models (Buntine 1994) provide a structuring mechanismwhich
can be exploited during the synthesis process, e.g., to decomposea problem into independent subproblems. Statistical algorithms like EM(Dempster, Laird, &Rubin 1977) are
often applicable to a wide range of problems;their generic
formulations allow a "plug’n’play"-style algorithm combination. Recently developedsophisticated data structures as
for examplekd-trees (PeUeg& Moore1999) offer orders-ofmagnitudespeed-up for certain problems but are rarely employed due to the increased programmingcomplexity they
cause. Finally, data analysis is characterized by an iterative
developmentstyle: an initial modelis hypothesized, implemented,evaluated on real data, and--if necessary---refined.
However,each iteration typically involves substantial programmingefforts as prototyping is often not sufficient to
work with real data sets and even small modelmodifications
mayrequire radically dfferent algorithms. Programsynthesis encapsulatesmanyof the statistical, numerical,and software engineering aspects of each iteration and thus allows
users to concentrate on their scientific application. Its fast
turn-around times make model refinement and design-space
exploration feasible.
We are currently developing AUTOBAYES,
a program
synthesis system for the generation of data analysis programsfrom statistical models. A statistical modelspecifies
the properties for each problemvariable (i.e., observationor
parameter) and its dependenciesin the form of a probability
distribution. Figure 1 shows the model for a random walk
(i.e., a simple noisy dynamical process) in AUTOBAYES’s
input notation. The last twolines are the core of the specification; the remaining lines just declare the modelvariables
Copyright(~) 2002,American
Associationfor Artificial Intelli-
73
model walk as ’RandomWalk’.
const nat n_points.
where 0 < n_points.
double rate as ’drift rate / time slice’.
double error as ~drift error / ~iee slice’.
where 0 < error.
data double x(O..n_points-1).
x(I) " gauss(cond(I>O,x(I-1),O)+rate,
maxpr(x[rate, error) ~or rate, error.
Figure ]: AUTOBAYES
specification
of random walk
and imposesomeadditional constraints on them. The distribution statement
x(I) " gaues(cond(I>O,x(I-1),O)+rate,
characterizes the observed data z: each observation zi is
normal distributed (or Gaussian) with a gradually changing
mean and constant but unknownerror. Its mean value is
expected to depend on the previous observation zi-1 and
an unknown
drift rate. Overtime, the data points thus drift
awayfrom their origin, which is here modeledas zero. The
optimization statement
max pr(xl~rate, error}) for Grate, error}
characterizes the data analysis task: find the values for the
unknownrate and error parameters which explain best the
knowndata z, i.e., maximizethe data probability under the
parameter values and modeldistributions.
From such models, AUTOBAYES
generates optimized
and thoroughly documented C/C++ code which can be
linked dynamically into the Matlab and Octave environments. AUTOBAYES
follows a schema-based synthesis approach rather than the direct proof-as-programs-basedapproach. However,schemas can equivalently be understood
as theorems of the domaintheory or as generic algorithms,
and AUTOBAYES
incorporates schemas with a distinctive
"theoremflavor" as well as such with moreof an "algorithm
flavor." The schemasare organizedinto four different hierarchical layers. The top-most layer contains the most theoremlike schemas,whicharc direct formalizationsof different decomposition theorems for graphical models. These schemas
gence(wwwoA_~oorg).
All rights reserved.
contain almost no codeskeletons. The subsequentlayers
then becomemoreand more algorithm-like as their code
skeletons growmoreand moredetailed. Theselayers conrain formuladecompostion
schemas(e.g., an index decomposition for independentlyandidentically distributed randomvariables), properstatistical algorithmschemas(e.g.,
EM),andnumericaloptimizationschemas(e.g., the simplex
method),respectively. This layeringis not only conceptual
but is also usedas a heuristic to controlthe synthesisprocess. Each schemahas a numberof enabling conditions
whichare checkedagainstthe current statistical model;each
schemaapplication can modifythe modeland thus trigger
the application of other schemas.This mechanism
allows
AUTOBAYES
to generate code as a compositionof different schemas,thus "re-inventing"data-analysis algorithms
from simple building-blocks. AUTOBAYES
augments the
schema-guidedapproach by symbolic-algebraic computation and can thus derive closed-formsolutions for many
problems and sub-problems. Moredetails on AUTOBAYES
and the incorporated schemascan be found in (Fischer,
Schumann,& Pressburger 2000) and (Fischer & Schumann
2001).
Wehave applied AUTOBAYES
to a large numberof textbook examples, machinelearning benchmarks,and NASA
applications. Theseencompass
suchdiverse tasks as for exampleclustering of rock samplespectra, changepoint
detection in "y-raybursts, andsoftwarereliability estimation.Synthesis times weregenerally in the sub-minuterange (on
360MHz
Sunworkstation);smallspecifications are typically
solved in a few seconds--thesynthesis time for the randomwalkexamplefrom Figure 1 is 1.8 seconds. The generated programswere as long as 2000lines of commented
code. This yields leveragefactors fromspecifcationto code
between1:10 and 1:30, dependingon the existence (and
AUTOBAYES’s
ability to find) closed-formsolutions. For
the randomwalk example,AUTOBAYES
can find the existing closed-formsolution; the generated120lines C-program
thus consist mainlyof boilerplate codeandcomments.
Acknowledgments
AUTOBAYES
is a joint developmenteffort with WrayBuntine and JohannSchumann.
References
W.L. Buntine1994."Operationsfor learning with graphical models".JAIR, 2:159-225,1994.
A. P. Dempster,N. M.Laird, andD. B. Rubin1977."Maximumlikelihood from incompletedata via the EMalgorithm(with discussion)".J. of the RoyalStatistical Society
series B, 39:1-38,1977.
B. Fischer, J. Schumann,
and T. Pressburger2000. "Generating Data Analysis Programsfrom Statistical Models
(PositionPaper)".In W.Taha,(ed.), Prec. Intl. Workshop
Semantics Applications, and Implementationof Program
Generation,LNCS1924, pp. 212-229, Montreal, Canada,
September2000. Springer.
B. Fischer and J. Schumann2000. AutoBayes:A Systemfor GeneratingDataAnalysisProgramsfromStatisti-
74
cal Models,2001.Submittedfor publication. Availableat
h~tp://ase, arc. nasa. gov/people/fischer.
D. Pelleg and A. Moore 1999. "Accelerating Exact k-meansAlgorithmswith GeometricReasoning".In
S. Chaudhuriand D. Madigan,(eds.), Prec. 5th KDD,pp.
277-281,San Diego, CA,August15-18 1999. ACM
Press.