Codes for astrostatistics: StatCodes & VOStat Eric Feigelson Penn State Vast range of statistical problems in modern astronomy • Poisson processes: point processes, time series analysis • Image analysis: MLE deconvolution, adaptive smoothing, wavelet analyses • Multivariate analysis & classification (w/ meas errors) • Survival analysis (censoring & truncation w/ meas errors) • Parametric models: Model selection, non-linear regression • Non-parametric methods • Confidence limits: bootstrap resampling • Prior knowledge: Bayesian inference (see talk at PhysStat 2003 conference) The problem Astronomers are insufficiently trained in modern applied statistics ….. but even if they knew what to do, they inadequate access to computer codes. • Astronomers never use large commercial statistical packages like SAS, SPSS, Statistica • Some astronomers sometimes use UNIX-based commandline systems like MatLab or S-Plus. • Astronomers like mini-codes in Numerical Recipes & often write their own codes. Many like IDL which has simple statistics. • NASA/NSF observatories produce huge data analysis codes (IRAF, AIPS, CIAO, …) which by policy avoid proprietary codes • A few specialized stand-along astrostat codes written under NASA funding: ROSTAT, ASURV, SLOPES, StatPy Altogether this is a very bad situation: vast statistical needs with very inadequate codes The rise of the Virtual Observatory Vast collections of calibrated data (images, spectra, time series), extracted catalogs (rows=sources, columns=properties), and source bibliographies emerged during the 1990s. NASA Science Archive Centers (MAST, HEASARC, IRSA, LAMDA), bibliographic databases (ADS, SIMBAD, NED), & more are being transformed into a federated (though still distributed & heterogeneous) system. XML metadata (VOTable), SOAP protocols, … for data mining & extraction. but originally no plan for visualization & statistical analysis of extracted datasets StatCodes: A partial solution • In late-1990s, the Penn State group created a Web metasite with annotated links to ~200 open source packages & codes of utility to astronomers. • Quite successful: 50-100 hits/day for 7 years. • Multivariate & time series methods most popular. But the collection of on-line codes was very inhomogeneous and incomplete R Finally a broad public-domain statistical software system emerges Based on the successful commercial UNIX-based S/S-Plus, R has an interactive command-line feel (like IDL), flexible data I/O, acceptable graphics, integration to C/Fortran/Python/…, and quite a lot of sophisticated statistical methods. Core R: 2000-page manual with ~200 functionalities, some very complex & advanced CRAN: 300 add-on packages, dozens useful to astronomers. Some are themselves full systems. VOStat: A Web service 1. Web form interface providing simple statistical R functions with VOTable inputs 2. Same R functions provided through a more sophisticated Java-based grid-computing mode. Dispersed VO Heavy data Requests User Answers data bases VOStat server Heavy statistical computation VOStat may be a big improvement but … • Generic Web-based services are inherently inflexible & limited. VOStat may serve to entice the astronomer to download R & perform the real analysis at home. • Astronomers need training in advanced methods before using them with R. Penn State has just created a Center for Astrostatistics to develop curriculum, conduct tutorials, provide template R code, etc. • R/CRAN does not serve huge VO datasets or some special astrostat needs. New methodological/code development underway (CMU, Cornell, PSU, UCIrv,…)