Computational AstroStatistics Bob Nichol (Carnegie Mellon) Motivation & Goals Multi-Resolutional KD-trees (examples) Npt functions (application) Mixture models (applications) Bayes network anomaly detection (application) Very high dimensional data NVO Problems Collaborators Pittsburgh Computational AstroStatistics (PiCA) Group Chris Miller, Percy Gomez, Kathy Romer, Andy Connolly, Andrew Hopkins, Mariangela Bernardi, Tomo Goto (Astro) Larry Wasserman, Chris Genovese, Wong Jang, Pierpaolo Brutti (Statistics) Andrew Moore, Jeff Schneider, Brigham Anderson, Alex Gray, Dan Pelleg (CS) Alex Szalay, Gordon Richards, Istvan Szapudi & others (SDSS) (See http://www.picagroup.org) First Motivation Cosmology is moving from a “discovery” science into a “statistical” science Drive for ``high precision’’ measurements: Cosmological parameters to a few percent; Accurate description of the complex structure in the universe; Control of observational and sampling biases New statistical tools – e.g. non-parametric analyses – are often computationally intensive. Also, often want to re-sample or Monte Carlo data. Second Motivation Last decade was dedicated to building more telescopes and instruments; more coming this decade as well (SDSS, Planck, LSST, 2MASS, DPOSS, MAP). Also, larger simulations. We have a “Data Flood”; SDSS is terabytes of data a night, while LSST is an SDSS every 5 nights! Petabytes by end of 00’s Highly correlated datasets and high dimensionality Existing statistics and algorithms do not scale into these regimes New Paradigm where we must build new tools before we can analyze & visualize data SDSS SDSS SDSS Data Area 10000 sq deg 3 Objects 2.5 billion 200 Spectra 1.5 million 200 Depth R=23 10 Attributes 144 presently 10 FACTOR OF 12,000,000 SDSS Science Most Distant Object! 100,000 spectra! Goal to build new, fast & efficient statistical algorithms Start with tree data structures: Multi-resolutional kd-trees Scale to n-dimensions (although for very high dimensions use new tree structures) Use Cached Representation (store at each node summary sufficient statistics). Compute counts from these statistics Prune the tree which is stored in memory! See Moore et al. 2001 (astro-ph/0012333) Many applications; suite of algorithms! Range Searches Fast range searches and catalog matching Also Prune cells inside! Greater saving in time Prune cells outside range N-point correlation functions The 2-point function has a long history in cosmology (Peebles 1980). It is the excess joint probability of a pair of points over that expected from a poisson process. Also long history (as point processes) in Statistics: Similarly, the three-point is defined as (so on!) Same 2pt, very different 3pt Naively, this is an n^N process, but all it is, is a set of range searches. Dual Tree Approach Usually binned into annuli rmin< r < rmax . Thus, for each r transverse both trees and prune pairs of nodes with either dmin < rmin ; dmax > rmax. Also, if dmin > rmin & dmax<rmax all pairs in these nodes are within annuli. Therefore, only need to calculate pairs cutting the boundaries. Extra speed-ups are possible doing multiple r’s together and controlled approximations N*N NlogN Time depends on density of points and binsize & scale N*N*N Fast Mixture Models Describe the data in N-dimensions as a mixture of, say, Gaussians (kernel shape less important than bandwidth!) The parameters of the model are then N gaussians each with a mean and covariance Iterate, testing using BIC and AIC at each iteration. Fast because of kdtrees (20 mins for 100,000 points on a PC!) Employ heuristic splitting algorithm as well Details in Connolly et al. 2000 (astro-ph/0008187) EM-Based Gaussian Mixture Clustering: 1 EM-Based Gaussian Mixture Clustering: 2 EM-Based Gaussian Mixture Clustering: 4 EM-Based Gaussian Mixture Clustering: 20 Applications Used in SDSS quasar selection (used to map the multi-color stellar locus) Gordon Richards @ PSU Anomaly detector (look for low probability points in N-dimensions) Optimal smoothing of large-scale structure SDSS QSO target selection in 4D color-space Cluster 9999 spectroscopically confirmed stars Cluster 8833 spectroscopically confirmed QSOs (33 gaussians) 99% for stars, 96% for QSOs Bayes Net Anomaly Detector Instead of using a single joint probability function (fitted to data) factorize into a smaller set of conditional probabilities Directional and acyclical If we know graph and conditional probabilities, we have valid probability function to whole model Use 1.5 million SDSS sources to learn model (25 variables each) Then evaluate the likelihood of each data being drawn from the model Lowest 1000 are anomalous; look at ‘em and follow `em up at Keck Unfortunately, a lot of error Advantage of Bayes Net is that to tells you why it was anomalous; the most unusual conditional probabilities Therefore, iterate loop and get scientist to highlight obvious errors; then suppress those errors so they do not return again Issue of productivity! Will Only Get Worse LSST will do an SDSS every 5 nights looking for transient objects producing petabytes of data (2007) VISTA will collect 300 Terabytes of data (2005) Archival Science is upon us! HST database has 20GBytes per day downloaded (10 times more than goes in!) Will Only Get Worse II Surveys spanning electromagnetic spectrum Combining these surveys is hard: different sensitivities, resolutions and physics Mixture of imaging, catalogs and spectra Difference between continuum and point processes Thousands of attributes per source What is VO? The “Virtual Observatory” must: Federate multi-wavelength data sources (interoperability) Must empower everyone (democratise) Be fast, distributed and easy Allow input and output Computer Science + Statistics! Scientists will need help through autonomous scientific discovery of large, multi-dimensional, correlated datasets Scientists will need fast databases Scientists will need distributed computing and fast networks Scientists will need new visualization tools CS and Statistics looking for new challenges: Also no data-rights & privacy issues New breed of students needed with IT skills Symbiotic Relationship VO Prototype Ideally we would like all parts of the VO to be web-servises DB http C# dym .NET http dym EM Lessons We Learnt Tough to marry research c code developed under linux to MS (pointers to memory) .NET has “unsafe” memory .NET server is hard to set up! Migrate to using VOTables to perform all I/O. Have server running at CMU so we can control code Very High Dimensions Using LLE and Isomap; looking for lower dimensional manifolds in higher dimensional spaces 500x2000 space from SDSS spectra Summary Era of New Cosmology: Massive data sources and search for subtle features & high precision measurements Need new methods that scale into these new regimes; ``a virtual universe’’ (students will need different skills). Perfect synergy with Stats, CS, Physics Good algorithms are as good as faster and more computers! The “glue” to make a “virtual observatory” is hard and complex. Don’t under-estimate the job Are the Features Real? (FDR)! This is an example of multiple hypothesis testing e.g. is every point consistent with a smooth p(k)? Let us first look at a simulated example: consider a 1000x1000 image with 40000 sources. FDR 30389 1505 9611 958495 2sigma 31497 22728 8503 937272 Bonferroni 27137 960000 0 12863 FDR makes 15 times few mistakes for the same power as traditional 2-sigma Why? Controls a scientifically meaningful quantity: FDR = No. of false discoveries/Total no. of discoveries And it is adaptive to the size of the dataset We used a FDR of 0.25 i.e. 25% of circled Points are in error Therefore, we can say with statistical rigor that most of these points a rejected and are thus ``features’’ No single point is a 3sigma deviation New statistics has enabled an astronomical discovery