THE US NATIONAL VIRTUAL OBSERVATORY Scientific Data Mining in Astronomy Kirk D. Borne George Mason University kborne@gmu.edu http://classweb.gmu.edu/kborne/ 2008 NVO Summer School 1 OUTLINE • • • • • • Scientific Databases Some key astronomy problems Astronomy Data Mining examples Suggested Reading Some Data Mining Software Summary 2008 NVO Summer School 2 OUTLINE • • • • • • Scientific Databases Some key astronomy problems Astronomy Data Mining examples Suggested Reading Some Data Mining Software Summary 2008 NVO Summer School 3 10 Unique Features of Scientific Data • Each of these characteristics requires special handling beyond what you read in standard data mining textbooks: 1. 2. 3. 4. 5. 6. 7. • Scientific data depend on experimental equipment and conditions. Scientific data have noise. Scientific data have been (or need to be) calibrated. Scientific units on data values are imperative. Scientific databases often contain associated columns: { value, error }. Scientific data values are often non-linear (log values, magnitudes, asinh). History of scientific data creation, processing, and versioning is critical = Provenance. 8. Metadata, Metadata, Metadata = tells us “who, what, when, where, how”. NOTE: Semantic Metadata are becoming more important = “why”. 9. Context is critical (e.g., brightness in an optical catalog is expressed in mags, but expressed in counts/sec in an X-ray catalog, or milli-Jansky in a radio catalog). 10. Scientific data have different levels of abstraction: raw, calibrated, reduced data products, derived information, extracted knowledge, published results. All of this makes the “Data Preparation” phase of any scientific data mining experiment even more critical and essential. 2008 NVO Summer School 4 OUTLINE • • • • • • Scientific Databases Some key astronomy problems Astronomy Data Mining examples Suggested Reading Some Data Mining Software Summary 2008 NVO Summer School 5 Some key astronomy problems • Some key astronomy problems that can be addressed with data mining techniques: • • • • • • • • • • • • • Cross-Match objects from different catalogues The distance problem (e.g., Photometric Redshift estimators) Star-Galaxy Separation Cosmic-Ray Detection in images Supernova Detection and Classification Morphological Classification (galaxies, AGN, gravitational lenses, ...) Class and Subclass Discovery (brown dwarfs, methane dwarfs, ...) Dimension Reduction = Correlation Discovery Learning Rules for improved classifiers Classification of massive data streams Real-time Classification of Astronomical Events Clustering of massive data collections Novelty, Anomaly, Outlier Detection in massive databases 2008 NVO Summer School 6 OUTLINE • • • • • • Scientific Databases Some key astronomy problems Astronomy Data Mining examples Suggested Reading Some Data Mining Software Summary 2008 NVO Summer School 7 Classification Methods: Decision Trees, Neural Networks, SVM (Support Vector Machines) There are 2 Classes! How do you ... -Separate them? -Distinguish them? -Learn the rules? -Classify them? Apply Kernel (SVM) 2008 NVO Summer School 8 Decision Tree Classification Example: SKICAT Star-Galaxy Discrimination Reference: ftp://iraf.noao.edu/iraf/conf/web/adass_proc/adass_95/yooj/yooj.html 2008 NVO Summer School 9 Decision Tree Classification Example: Classification of candidates for new supernova in galaxies Reference: http://spiff.rit.edu/richmond/sdss/sn_survey/scan_manual/sn_scan.html 2008 NVO Summer School 10 Clustering is used to discover the different unique groupings (classes) of attribute values. The case shown below is not obvious: one or two groups? 2008 NVO Summer School 11 This case is easier: there are two groups. (in fact, this is the same set of data elements as shown on the previous slide, but plotted here using a different attribute.) 2008 NVO Summer School 12 Clustering in multiple dimensions: colors combined from SDSS & 2MASS magnitudes 2008 NVO Summer School 13 Clustering: Class Discovery and Rule Learning • Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf • Clusters and the separation of classes depend on which attributes (dimensions) are chosen to be projected, as in the following star-galaxy discrimination test: Not good 2008 NVO Summer School Good 14 Semisupervised Learning: Outlier Detection • Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf A demonstration of a generic machine-assisted discovery problem — data mapping and a search for outliers. This schematic illustration is of the clustering problem in a parameter space given by three object attributes: P1, P2, and P3. In this example, most of the data points are assumed to be contained in three, dominant clusters (DC1, DC2, and DC3). However, one may want to discover less populated clusters (e.g., small groups or even isolated points), some of which may be too sparsely populated, or lie too close to one of the major data clouds. In some cases, negative clusters (holes), may exist in one of the major data clusters. 2008 NVO Summer School 15 Outlier Detection: Serendipitous Discovery of Rare or New Objects & Events 2008 NVO Summer School 16 Principal Components Analysis & Independent Components Analysis Cepheid Variables: Cosmic Yardsticks -- One Correlation -- Two Classes! ... Class Discovery! 2008 NVO Summer School 17 Example: SOM (Self-Organizing Map) • • The SOM (SelfOrganizing Map) is one technique for organizing information in a database based upon links between concepts. It can be used to find hidden relationships and patterns in more complex data collections, usually based on links between keywords or metadata. 2008 NVO Summer School 18 Astronomy Data Mining in Action Exploring the Time Domain Mega-Flares on normal Sun-like stars = a star like our Sun increased in brightness 300X one night! … say what?? 2008 NVO Summer School 19 Example: The Thinking Telescope Sample Data Mining Applications: (credit: http://www.thinkingtelescopes.lanl.gov/ ) Automated Feature Extraction: Real-time identification of artifacts and transients in direct and difference images. Classifiers: Automated classification of celestial objects based on temporal and spectral properties. 2008 NVO Summer School Anomaly Detection: Real-time recognition of important deviations from normal behavior for persistent sources. 20 From Sensors to Sense From Data to Knowledge: from sensors to sense 2008 NVO Summer School Data Information Knowledge 21 VOEventNet Palomar-Quest VOEventNet: a Rapid-Response Telescope Grid GRB satellites PQ next-day pipelines baseline sky Raptor catalog Palomar 60” PQ Event Factory Event Synthesis Engine VOEvent database VOEventNet Pairitel known Variables known asteroids 2MASS SDSS remote archives eStar 2008 NVO Summer School Reference: http://voeventnet.caltech.edu/ 22 Learning From Archived Temporal Data (Time Series): Classify New Data (Bayes Analysis or Markov Modeling) 2008 NVO Summer School 23 Photometric-Redshift Estimation Photometric vs. Spectroscopic Redshift Estimates: • • • Left panel: standard technique Right panel: Machine Learning (data mining) application Reference: http://arxiv.org/abs/0710.4482 2008 NVO Summer School 24 Star-Galaxy Separation in Clustered Feature Space * = star • = galaxy http://arxiv.org/abs/astro-ph/9508012 2008 NVO Summer School 25 Bayesian Probabilistic Estimation for Catalog Cross-Matching • Reference: http://arxiv.org/abs/astro-ph/0605216 2008 NVO Summer School 26 Fundamental Plane for 156,000 cross-matched Sloan+2MASS Elliptical Galaxies: plot shows variance captured by first 2 Principal Components as a function of local galaxy density. Reference: Borne, Dutta, Giannella, Kargupta, & Griffin 2008 Slide Content % of variance captured by PC1+PC2 • • • • Slide content Slide content Slide content low 2008 NVO Summer School (Local Galaxy Density) high 27 OUTLINE • • • • • • Scientific Databases Some key astronomy problems Astronomy Data Mining examples Suggested Reading Some Data Mining Software Summary 2008 NVO Summer School 28 Suggested Reading: Data Mining in Astronomy • • • • • • • • • Djorgovski et al. 2000, Searches for Rare and New Types of Objects. http://arxiv.org/abs/astro-ph/0012453 Djorgovski et al. 2000, Exploration of Large Digital Sky Surveys. http://arxiv.org/abs/astro-ph/0012489 Djorgovski et al. 2001, Exploration of Parameter Spaces in a Virtual Observatory. http://arxiv.org/abs/astro-ph/0108346 Mining the Sky, 2001, published proceedings of ESO conference. Suchkov et al. 2003, Automated Object Classification with ClassX. astro-ph/0210407 Suchkov, Hanisch, & Margon 2005, A Census of Object Types and Redshift Estimates in the SDSS Photometric Catalog from a Trained Decision Tree Classifier. http://adsabs.harvard.edu/abs/2005AJ....130.2439S Giannella et al. 2006, Distributed Data Mining for Astronomy Catalogs. http://www.cs.umbc.edu/~hillol/PUBS/Papers/Astro.pdf Rohde et al. 2006, Matching of Catalogues by Probabilistic Pattern Classification. http://adsabs.harvard.edu/abs/2006MNRAS.369....2R Budavari & Szalay 2008, Probabilistic Cross-Identification of Astronomical Sources. http://adsabs.harvard.edu/abs/2008ApJ...679..301B 2008 NVO Summer School 29 Suggested Reading, continued: Data Mining in Astronomy • • • • • • • • Odewahn et al. 1993, Star-Galaxy Separation with a Neural Network. 2: Multiple Schmidt Plate Fields. http://adsabs.harvard.edu/abs/1993PASP..105.1354O Borne 2000, Science User Scenarios for a Virtual Observatory Design Reference Mission: Science Requirements for Data Mining. astro-ph/0008307 Brunner et al. 2001, Massive Datasets in Astronomy. astro-ph/0106481 Gray et al. 2002, Data Mining the SDSS SkyServer Database. http://arxiv.org/abs/cs/0202014 Odewahn et al. 2004, The Digitized Second Palomar Observatory Sky Survey (DPOSS). III. Star-Galaxy Separation. http://adsabs.harvard.edu/abs/2004AJ....128.3092O Ball, Brunner, et al. 2006, Robust Machine Learning Applied to Astronomical Data Sets. I. Star-Galaxy Classification of the Sloan Digital Sky Survey DR3 Using Decision Trees. http://adsabs.harvard.edu/abs/2006ApJ...650..497B Ball, Brunner, et al. 2007, Robust Machine Learning Applied to Astronomical Data Sets. II. Quantifying Photometric Redshifts for Quasars Using Instance-based Learning. http://adsabs.harvard.edu/abs/2007ApJ...663..774B Ball, Brunner, et al. 2008, Robust Machine Learning Applied to Astronomical Data Sets. III. Probabilistic Photometric Redshifts for Galaxies and Quasars in the SDSS and GALEX. http://adsabs.harvard.edu/abs/2008ApJ...683...12B 2008 NVO Summer School 30 Suggested Reading, continued: Data Mining in Astronomy • • • • • • • • Rogers & Riess 1994, Detection and Classification of CCD Defects with an Artificial Neural Network. http://adsabs.harvard.edu/abs/1994PASP..106..532R Feeney et al. 2005, Automated Detection of Classical Novae with Neural Networks. http://adsabs.harvard.edu/abs/2005AJ....130...84F Wadadekar 2005, Estimating Photometric Redshifts Using Support Vector Machines. http://adsabs.harvard.edu/abs/2005PASP..117...79W Bazell & Miller 2005, Class Discovery in Galaxy Classification. http://adsabs.harvard.edu/abs/2005ApJ...618..723B Bazell, Miller, & SubbaRao 2006, Objective Subclass Determination of Sloan Digital Sky Survey Spectroscopically Unclassified Objects. http://adsabs.harvard.edu/abs/2006ApJ...649..678B Ferreras et al. 2006, A Principal Component Analysis approach to the Star Formation History of Elliptical Galaxies in Compact Groups. http://adsabs.harvard.edu/abs/2006MNRAS.370..828F Way & Srivastava 2006, Novel Methods for Predicting Photometric Redshifts from Broadband Photometry Using Virtual Sensors. http://adsabs.harvard.edu/abs/2006ApJ...647..102W Carliles et al. 2007, Photometric Redshift Estimation on SDSS Data Using Random Forests. http://arxiv.org/abs/0711.2477 2008 NVO Summer School 31 OUTLINE • • • • • • Scientific Databases Some key astronomy problems Astronomy Data Mining examples Suggested Reading Some Data Mining Software Summary 2008 NVO Summer School 32 Some Data Mining Software & Projects • • • General data mining software packages: – – – Weka (Java): http://www.cs.waikato.ac.nz/ml/weka/ Weka4WS (Grid-enabled): http://grid.deis.unical.it/weka4ws/ RapidMiner: http://www.rapidminer.com/ • • • • • • VO-Neural: http://voneural.na.infn.it/ AstroWeka: http://astroweka.sourceforge.net/ OpenSkyQuery: http://www.openskyquery.net/ ALADIN: http://aladin.u-strasbg.fr/ MIRAGE: http://cm.bell-labs.com/who/tkh/mirage/ AstroBox: http://services.china-vo.org/ • • • • • GRIST: http://grist.caltech.edu/ ClassX: http://heasarc.gsfc.nasa.gov/classx/ LCDM: http://dposs.ncsa.uiuc.edu/ F-MASS: http://www.itsc.uah.edu/f-mass/ NCDM: http://www.ncdm.uic.edu/ Astronomy-specific software and/or user clients: Astronomical and/or Scientific Data Mining Projects: 2008 NVO Summer School 33 Weka: http://www.cs.waikato.ac.nz/ml/weka/ • • • • • Weka is in your NVOSS software distribution. Weka is a collection of open source machine learning algorithms for data mining tasks. Weka algorithms can either be applied directly to a dataset or called from your own Java code. Weka comes with its own GUI. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. 2008 NVO Summer School 34 AstroWeka: http://astroweka.sourceforge.net/ http://www.iterating.com/products/Weka http://weka.sourceforge.net/wekadoc/index.php/en:Knowledge_Flow_%283.4.10%29 2008 NVO Summer School 35 ALADIN: http://aladin.u-strasbg.fr/ 2008 NVO Summer School 36 MIRAGE: http://cm.bell-labs.com/who/tkh/mirage/ Java Package for exploratory data analysis (EDA), correlation mining, and interactive pattern discovery. 2008 NVO Summer School 37 OUTLINE • • • • • • Scientific Databases Some key astronomy problems Astronomy Data Mining examples Suggested Reading Some Data Mining Software Summary 2008 NVO Summer School 38 Science is Knowledge Work Data Information Knowledge • Knowledge Discovery is the central theme of science. • Knowledge Discovery in Databases (KDD) is the killer app for large scientific databases. • Therefore, KDD (i.e., Data Mining) is an essential tool, since “big-data” science is here to stay (at petabytes and beyond). 2008 NVO Summer School 39 Scientific Knowledge Discovery 2008 NVO Summer School 40 Heliophysics Space Weather Example 2008 NVO Summer School 41 Sun-Earth Space Environment – Rich Source of Heliophysical Phenomena 2008 NVO Summer School 42 Multi-point Observations and Models of Space Plasmas Deliver a Deluge of Physical Measurements 2008 NVO Summer School 43 2008 NVO Summer School 44 Heliophysics Space Weather Example CME = Coronal Mass Ejection SEP = Solar Energetic Particle 2008 NVO Summer School 45 Data Mining: It is more than just connecting the dots 2008 NVO Summer School Reference: http://homepage.interaccess.com/~purcellm/lcas/Cartoons/cartoons.htm 46 Sample Astronomy Data Mining Application Ideas for your Projects – Neural Network for Pixel Classification: Event Detection and Prediction (e.g., Supernova or Cosmic-ray hit?) – Bayesian Network for Object Classification (star or galaxy?) – PCA for finding Fundamental Planes of Galaxy Parameters – PCA (weakest component) for Outlier Detection: anomalies, novel discoveries, new objects – Link Analysis (Association Mining) for Causal Event Detection (e.g., linking optical transients with gamma-ray events) – Clustering analysis: Spatial, Temporal, or any scientific database parameters – Markov models: Temporal mining, classification, and prediction from time series data 2008 NVO Summer School 47