Astronomy Data Mining in Action

advertisement
THE US NATIONAL VIRTUAL OBSERVATORY
Scientific Data Mining
in Astronomy
Kirk D. Borne
George Mason University
kborne@gmu.edu
http://classweb.gmu.edu/kborne/
2008 NVO Summer School
1
OUTLINE
•
•
•
•
•
•
Scientific Databases
Some key astronomy problems
Astronomy Data Mining examples
Suggested Reading
Some Data Mining Software
Summary
2008 NVO Summer School
2
OUTLINE
•
•
•
•
•
•
Scientific Databases
Some key astronomy problems
Astronomy Data Mining examples
Suggested Reading
Some Data Mining Software
Summary
2008 NVO Summer School
3
10 Unique Features of Scientific Data
•
Each of these characteristics requires special handling
beyond what you read in standard data mining textbooks:
1.
2.
3.
4.
5.
6.
7.
•
Scientific data depend on experimental equipment and conditions.
Scientific data have noise.
Scientific data have been (or need to be) calibrated.
Scientific units on data values are imperative.
Scientific databases often contain associated columns: { value, error }.
Scientific data values are often non-linear (log values, magnitudes, asinh).
History of scientific data creation, processing, and versioning is critical =
Provenance.
8.
Metadata, Metadata, Metadata = tells us “who, what, when, where, how”.
NOTE: Semantic Metadata are becoming more important = “why”.
9.
Context is critical (e.g., brightness in an optical catalog is expressed in mags,
but expressed in counts/sec in an X-ray catalog, or milli-Jansky in a radio
catalog).
10. Scientific data have different levels of abstraction: raw, calibrated, reduced
data products, derived information, extracted knowledge, published results.
All of this makes the “Data Preparation” phase of any
scientific data mining experiment even more critical and
essential.
2008 NVO Summer School
4
OUTLINE
•
•
•
•
•
•
Scientific Databases
Some key astronomy problems
Astronomy Data Mining examples
Suggested Reading
Some Data Mining Software
Summary
2008 NVO Summer School
5
Some key astronomy problems
•
Some key astronomy problems that can be addressed with
data mining techniques:
•
•
•
•
•
•
•
•
•
•
•
•
•
Cross-Match objects from different catalogues
The distance problem (e.g., Photometric Redshift estimators)
Star-Galaxy Separation
Cosmic-Ray Detection in images
Supernova Detection and Classification
Morphological Classification (galaxies, AGN, gravitational lenses, ...)
Class and Subclass Discovery (brown dwarfs, methane dwarfs, ...)
Dimension Reduction = Correlation Discovery
Learning Rules for improved classifiers
Classification of massive data streams
Real-time Classification of Astronomical Events
Clustering of massive data collections
Novelty, Anomaly, Outlier Detection in massive databases
2008 NVO Summer School
6
OUTLINE
•
•
•
•
•
•
Scientific Databases
Some key astronomy problems
Astronomy Data Mining examples
Suggested Reading
Some Data Mining Software
Summary
2008 NVO Summer School
7
Classification Methods:
Decision Trees, Neural Networks,
SVM (Support Vector Machines)
There are 2 Classes!
How do you ...
-Separate them?
-Distinguish them?
-Learn the rules?
-Classify them?
Apply
Kernel
(SVM)
2008 NVO Summer School
8
Decision Tree Classification Example: SKICAT
Star-Galaxy Discrimination
Reference: ftp://iraf.noao.edu/iraf/conf/web/adass_proc/adass_95/yooj/yooj.html
2008 NVO Summer School
9
Decision Tree Classification Example:
Classification of candidates for new supernova in galaxies
Reference: http://spiff.rit.edu/richmond/sdss/sn_survey/scan_manual/sn_scan.html
2008 NVO Summer School
10
Clustering is used to discover the different
unique groupings (classes) of attribute values.
The case shown below is not obvious: one or two groups?
2008 NVO Summer School
11
This case is easier: there are two groups.
(in fact, this is the same set of data elements as shown on the
previous slide, but plotted here using a different attribute.)
2008 NVO Summer School
12
Clustering in multiple dimensions: colors
combined from SDSS & 2MASS magnitudes
2008 NVO Summer School
13
Clustering: Class Discovery and Rule Learning
•
Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf
• Clusters and the separation of classes depend on which
attributes (dimensions) are chosen to be projected, as in
the following star-galaxy discrimination test:
Not good
2008 NVO Summer School
Good
14
Semisupervised Learning:
Outlier Detection
•
Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf
A demonstration of a generic machine-assisted
discovery problem — data mapping and a search
for outliers.
This schematic illustration is of the clustering
problem in a parameter space given by three
object attributes: P1, P2, and P3.
In this example, most of the data points are
assumed to be contained in three, dominant
clusters (DC1, DC2, and DC3).
However, one may want to discover less
populated clusters (e.g., small groups or even
isolated points), some of which may be too
sparsely populated, or lie too close to one of the
major data clouds.
In some cases, negative clusters (holes), may
exist in one of the major data clusters.
2008 NVO Summer School
15
Outlier Detection: Serendipitous Discovery
of Rare or New Objects & Events
2008 NVO Summer School
16
Principal Components Analysis &
Independent Components Analysis
Cepheid Variables:
Cosmic Yardsticks
-- One Correlation
-- Two Classes!
... Class Discovery!
2008 NVO Summer School
17
Example: SOM (Self-Organizing Map)
•
•
The SOM (SelfOrganizing Map) is
one technique for
organizing
information in a
database based
upon links between
concepts.
It can be used to
find hidden
relationships and
patterns in more
complex data
collections, usually
based on links
between keywords
or metadata.
2008 NVO Summer School
18
Astronomy Data
Mining in Action
Exploring
the Time
Domain
Mega-Flares on normal
Sun-like stars = a star like
our Sun increased in
brightness 300X one night!
… say what??
2008 NVO Summer School
19
Example: The Thinking Telescope
Sample Data Mining Applications: (credit: http://www.thinkingtelescopes.lanl.gov/ )
Automated Feature Extraction: Real-time identification of artifacts and transients in direct and difference images.
Classifiers: Automated classification of celestial objects based on temporal and spectral properties.
2008 NVO Summer School
Anomaly Detection: Real-time recognition of important deviations from normal behavior for persistent sources.
20
From Sensors to Sense
From Data to Knowledge:
from sensors to sense
2008 NVO Summer School
Data  Information
 Knowledge
21
VOEventNet
Palomar-Quest
VOEventNet: a Rapid-Response Telescope Grid
GRB
satellites
PQ next-day
pipelines
baseline
sky
Raptor
catalog
Palomar 60”
PQ Event
Factory
Event Synthesis
Engine
VOEvent
database
VOEventNet
Pairitel
known
Variables
known
asteroids
2MASS
SDSS
remote archives
eStar
2008 NVO Summer School
Reference: http://voeventnet.caltech.edu/
22
Learning From Archived Temporal Data (Time Series):
Classify New Data (Bayes Analysis or Markov Modeling)
2008 NVO Summer School
23
Photometric-Redshift Estimation
Photometric vs. Spectroscopic Redshift Estimates:
•
•
•
Left panel: standard technique
Right panel: Machine Learning (data mining) application
Reference: http://arxiv.org/abs/0710.4482
2008 NVO Summer School
24
Star-Galaxy Separation in
Clustered Feature Space
* = star
• = galaxy
http://arxiv.org/abs/astro-ph/9508012
2008 NVO Summer School
25
Bayesian Probabilistic Estimation
for Catalog Cross-Matching
•
Reference: http://arxiv.org/abs/astro-ph/0605216
2008 NVO Summer School
26
Fundamental Plane for 156,000 cross-matched Sloan+2MASS Elliptical
Galaxies: plot shows variance captured by first 2 Principal Components
as a function of local galaxy density.
Reference: Borne, Dutta, Giannella, Kargupta, & Griffin 2008
Slide Content
% of variance captured by PC1+PC2
•
•
•
•
Slide content
Slide content
Slide content
low
2008 NVO Summer School
(Local Galaxy Density)
high
27
OUTLINE
•
•
•
•
•
•
Scientific Databases
Some key astronomy problems
Astronomy Data Mining examples
Suggested Reading
Some Data Mining Software
Summary
2008 NVO Summer School
28
Suggested Reading:
Data Mining in Astronomy
•
•
•
•
•
•
•
•
•
Djorgovski et al. 2000, Searches for Rare and New Types of Objects.
http://arxiv.org/abs/astro-ph/0012453
Djorgovski et al. 2000, Exploration of Large Digital Sky Surveys.
http://arxiv.org/abs/astro-ph/0012489
Djorgovski et al. 2001, Exploration of Parameter Spaces in a Virtual Observatory.
http://arxiv.org/abs/astro-ph/0108346
Mining the Sky, 2001, published proceedings of ESO conference.
Suchkov et al. 2003, Automated Object Classification with ClassX. astro-ph/0210407
Suchkov, Hanisch, & Margon 2005, A Census of Object Types and Redshift
Estimates in the SDSS Photometric Catalog from a Trained Decision Tree Classifier.
http://adsabs.harvard.edu/abs/2005AJ....130.2439S
Giannella et al. 2006, Distributed Data Mining for Astronomy Catalogs.
http://www.cs.umbc.edu/~hillol/PUBS/Papers/Astro.pdf
Rohde et al. 2006, Matching of Catalogues by Probabilistic Pattern Classification.
http://adsabs.harvard.edu/abs/2006MNRAS.369....2R
Budavari & Szalay 2008, Probabilistic Cross-Identification of Astronomical Sources.
http://adsabs.harvard.edu/abs/2008ApJ...679..301B
2008 NVO Summer School
29
Suggested Reading, continued:
Data Mining in Astronomy
•
•
•
•
•
•
•
•
Odewahn et al. 1993, Star-Galaxy Separation with a Neural Network. 2: Multiple
Schmidt Plate Fields. http://adsabs.harvard.edu/abs/1993PASP..105.1354O
Borne 2000, Science User Scenarios for a Virtual Observatory Design Reference
Mission: Science Requirements for Data Mining. astro-ph/0008307
Brunner et al. 2001, Massive Datasets in Astronomy. astro-ph/0106481
Gray et al. 2002, Data Mining the SDSS SkyServer Database.
http://arxiv.org/abs/cs/0202014
Odewahn et al. 2004, The Digitized Second Palomar Observatory Sky Survey
(DPOSS). III. Star-Galaxy Separation.
http://adsabs.harvard.edu/abs/2004AJ....128.3092O
Ball, Brunner, et al. 2006, Robust Machine Learning Applied to Astronomical Data
Sets. I. Star-Galaxy Classification of the Sloan Digital Sky Survey DR3 Using
Decision Trees. http://adsabs.harvard.edu/abs/2006ApJ...650..497B
Ball, Brunner, et al. 2007, Robust Machine Learning Applied to Astronomical Data
Sets. II. Quantifying Photometric Redshifts for Quasars Using Instance-based
Learning. http://adsabs.harvard.edu/abs/2007ApJ...663..774B
Ball, Brunner, et al. 2008, Robust Machine Learning Applied to Astronomical Data
Sets. III. Probabilistic Photometric Redshifts for Galaxies and Quasars in the SDSS
and GALEX. http://adsabs.harvard.edu/abs/2008ApJ...683...12B
2008 NVO Summer School
30
Suggested Reading, continued:
Data Mining in Astronomy
•
•
•
•
•
•
•
•
Rogers & Riess 1994, Detection and Classification of CCD Defects with an
Artificial Neural Network. http://adsabs.harvard.edu/abs/1994PASP..106..532R
Feeney et al. 2005, Automated Detection of Classical Novae with Neural
Networks. http://adsabs.harvard.edu/abs/2005AJ....130...84F
Wadadekar 2005, Estimating Photometric Redshifts Using Support Vector
Machines. http://adsabs.harvard.edu/abs/2005PASP..117...79W
Bazell & Miller 2005, Class Discovery in Galaxy Classification.
http://adsabs.harvard.edu/abs/2005ApJ...618..723B
Bazell, Miller, & SubbaRao 2006, Objective Subclass Determination of Sloan
Digital Sky Survey Spectroscopically Unclassified Objects.
http://adsabs.harvard.edu/abs/2006ApJ...649..678B
Ferreras et al. 2006, A Principal Component Analysis approach to the Star
Formation History of Elliptical Galaxies in Compact Groups.
http://adsabs.harvard.edu/abs/2006MNRAS.370..828F
Way & Srivastava 2006, Novel Methods for Predicting Photometric Redshifts from
Broadband Photometry Using Virtual Sensors.
http://adsabs.harvard.edu/abs/2006ApJ...647..102W
Carliles et al. 2007, Photometric Redshift Estimation on SDSS Data Using Random
Forests. http://arxiv.org/abs/0711.2477
2008 NVO Summer School
31
OUTLINE
•
•
•
•
•
•
Scientific Databases
Some key astronomy problems
Astronomy Data Mining examples
Suggested Reading
Some Data Mining Software
Summary
2008 NVO Summer School
32
Some Data Mining Software & Projects
•
•
•
General data mining software packages:
–
–
–
Weka (Java): http://www.cs.waikato.ac.nz/ml/weka/
Weka4WS (Grid-enabled): http://grid.deis.unical.it/weka4ws/
RapidMiner: http://www.rapidminer.com/
•
•
•
•
•
•
VO-Neural: http://voneural.na.infn.it/
AstroWeka: http://astroweka.sourceforge.net/
OpenSkyQuery: http://www.openskyquery.net/
ALADIN: http://aladin.u-strasbg.fr/
MIRAGE: http://cm.bell-labs.com/who/tkh/mirage/
AstroBox: http://services.china-vo.org/
•
•
•
•
•
GRIST: http://grist.caltech.edu/
ClassX: http://heasarc.gsfc.nasa.gov/classx/
LCDM: http://dposs.ncsa.uiuc.edu/
F-MASS: http://www.itsc.uah.edu/f-mass/
NCDM: http://www.ncdm.uic.edu/
Astronomy-specific software and/or user clients:
Astronomical and/or Scientific Data Mining Projects:
2008 NVO Summer School
33
Weka:
http://www.cs.waikato.ac.nz/ml/weka/
•
•
•
•
•
Weka is in your NVOSS software distribution.
Weka is a collection of open source machine learning algorithms for
data mining tasks.
Weka algorithms can either be applied directly to a dataset or called
from your own Java code.
Weka comes with its own GUI.
Weka contains tools for data pre-processing, classification, regression,
clustering, association rules, and visualization.
2008 NVO Summer School
34
AstroWeka:
http://astroweka.sourceforge.net/
http://www.iterating.com/products/Weka
http://weka.sourceforge.net/wekadoc/index.php/en:Knowledge_Flow_%283.4.10%29
2008 NVO Summer School
35
ALADIN: http://aladin.u-strasbg.fr/
2008 NVO Summer School
36
MIRAGE:
http://cm.bell-labs.com/who/tkh/mirage/
Java Package for exploratory data analysis (EDA),
correlation mining, and interactive pattern discovery.
2008 NVO Summer School
37
OUTLINE
•
•
•
•
•
•
Scientific Databases
Some key astronomy problems
Astronomy Data Mining examples
Suggested Reading
Some Data Mining Software
Summary
2008 NVO Summer School
38
Science is Knowledge Work
Data  Information  Knowledge
• Knowledge Discovery is the central theme of
science.
• Knowledge Discovery in Databases (KDD) is the
killer app for large scientific databases.
• Therefore, KDD (i.e., Data Mining) is an essential
tool, since “big-data” science is here to stay (at
petabytes and beyond).
2008 NVO Summer School
39
Scientific Knowledge Discovery
2008 NVO Summer School
40
Heliophysics
Space Weather Example
2008 NVO Summer School
41
Sun-Earth Space Environment –
Rich Source of Heliophysical Phenomena
2008 NVO Summer School
42
Multi-point Observations and Models of Space
Plasmas Deliver a Deluge of Physical Measurements
2008 NVO Summer School
43
2008 NVO Summer School
44
Heliophysics Space Weather Example
CME = Coronal Mass Ejection
SEP = Solar Energetic Particle
2008 NVO Summer School
45
Data Mining:
It is more than just connecting the dots
2008 NVO Summer School
Reference: http://homepage.interaccess.com/~purcellm/lcas/Cartoons/cartoons.htm
46
Sample Astronomy Data Mining Application
Ideas for your Projects
– Neural Network for Pixel Classification: Event Detection and
Prediction (e.g., Supernova or Cosmic-ray hit?)
– Bayesian Network for Object Classification (star or galaxy?)
– PCA for finding Fundamental Planes of Galaxy Parameters
– PCA (weakest component) for Outlier Detection: anomalies,
novel discoveries, new objects
– Link Analysis (Association Mining) for Causal Event Detection
(e.g., linking optical transients with gamma-ray events)
– Clustering analysis: Spatial, Temporal, or any scientific
database parameters
– Markov models: Temporal mining, classification, and
prediction from time series data
2008 NVO Summer School
47
Download