Computational AstroStatistics Bob Nichol (Carnegie Mellon)

advertisement
Computational AstroStatistics
Bob Nichol (Carnegie Mellon)
Motivation & Goals
Multi-Resolutional KD-trees (examples)
Npt functions (application)
Mixture models (applications)
Bayes network anomaly detection (application)
Very high dimensional data
NVO Problems
Collaborators
Pittsburgh Computational AstroStatistics (PiCA) Group
Chris Miller, Percy Gomez, Kathy Romer, Andy
Connolly, Andrew Hopkins, Mariangela Bernardi,
Tomo Goto (Astro)
Larry Wasserman, Chris Genovese, Wong Jang,
Pierpaolo Brutti (Statistics)
Andrew Moore, Jeff Schneider, Brigham
Anderson, Alex Gray, Dan Pelleg (CS)
Alex Szalay, Gordon Richards, Istvan Szapudi &
others (SDSS)
(See http://www.picagroup.org)
First Motivation
Cosmology is moving from a “discovery”
science into a “statistical” science
Drive for ``high precision’’ measurements:
Cosmological parameters to a few percent;
Accurate description of the complex structure in
the universe;
Control of observational and sampling biases
New statistical tools – e.g. non-parametric analyses
– are often computationally intensive. Also, often
want to re-sample or Monte Carlo data.
Second Motivation
Last decade was dedicated to building more
telescopes and instruments; more coming this
decade as well (SDSS, Planck, LSST, 2MASS,
DPOSS, MAP). Also, larger simulations.
We have a “Data Flood”; SDSS is terabytes of
data a night, while LSST is an SDSS every 5
nights! Petabytes by end of 00’s
Highly correlated datasets and high dimensionality
Existing statistics and algorithms do not scale into
these regimes
New Paradigm where we must build new
tools before we can analyze & visualize data
SDSS
SDSS
SDSS Data
Area
10000 sq deg 3
Objects
2.5 billion
200
Spectra
1.5 million
200
Depth
R=23
10
Attributes 144 presently 10
FACTOR OF 12,000,000
SDSS Science

Most Distant Object!
 100,000 spectra!
Goal to build new, fast &
efficient statistical algorithms
Start with tree data structures:
Multi-resolutional kd-trees
Scale to n-dimensions (although for very high
dimensions use new tree structures)
Use Cached Representation (store at each node
summary sufficient statistics). Compute counts
from these statistics
Prune the tree which is stored in memory!
See Moore et al. 2001 (astro-ph/0012333)
Many applications; suite of algorithms!
Range Searches
Fast range searches and catalog matching
Also Prune cells inside!
Greater saving in time
Prune cells outside range
N-point correlation functions
The 2-point function has a long history in cosmology
(Peebles 1980). It is the excess joint probability of a pair
of points over that expected from a poisson process.
Also long history (as point processes) in Statistics:
Similarly, the three-point is defined as (so on!)
Same 2pt, very different 3pt
Naively, this is an n^N process, but all it is, is a
set of range searches.
Dual Tree Approach
Usually binned into annuli rmin< r < rmax . Thus,
for each r transverse both trees and prune pairs of
nodes with either dmin < rmin ; dmax > rmax. Also, if
dmin > rmin & dmax<rmax all pairs in these nodes are
within annuli. Therefore, only need to calculate
pairs cutting the boundaries.
Extra speed-ups are possible doing multiple r’s
together and controlled approximations
N*N
NlogN
Time depends on
density of points
and binsize & scale
N*N*N
Fast Mixture Models
Describe the data in N-dimensions as a mixture of,
say, Gaussians (kernel shape less important than
bandwidth!)
The parameters of the model are then N gaussians
each with a mean and covariance
Iterate, testing using BIC and AIC at each
iteration. Fast because of kdtrees (20 mins for
100,000 points on a PC!)
Employ heuristic splitting algorithm as well
Details in Connolly et al. 2000 (astro-ph/0008187)
EM-Based Gaussian Mixture
Clustering: 1
EM-Based Gaussian Mixture
Clustering: 2
EM-Based Gaussian Mixture
Clustering: 4
EM-Based Gaussian Mixture
Clustering: 20
Applications
Used in SDSS quasar selection (used to map
the multi-color stellar locus)
Gordon Richards @ PSU
Anomaly detector (look for low probability
points in N-dimensions)
Optimal smoothing of large-scale structure
SDSS QSO target selection
in 4D color-space
Cluster 9999
spectroscopically
confirmed stars
Cluster 8833
spectroscopically
confirmed QSOs (33
gaussians)
99% for stars, 96% for QSOs
Bayes Net Anomaly Detector
Instead of using a single joint probability function
(fitted to data) factorize into a smaller set of
conditional probabilities
Directional and acyclical
If we know graph and conditional probabilities, we
have valid probability function to whole model
Use 1.5 million SDSS
sources to learn model (25
variables each)
Then evaluate the likelihood
of each data being drawn
from the model
Lowest 1000 are anomalous;
look at ‘em and follow `em
up at Keck
Unfortunately, a lot of error
Advantage of Bayes Net is
that to tells you why it was
anomalous; the most
unusual conditional
probabilities
Therefore, iterate loop and
get scientist to highlight
obvious errors; then
suppress those errors so
they do not return again
Issue of productivity!
Will Only Get Worse
LSST will do an SDSS every 5 nights
looking for transient objects producing
petabytes of data (2007)
VISTA will collect 300 Terabytes of data
(2005)
Archival Science is upon us! HST
database has 20GBytes per day
downloaded (10 times more than goes in!)
Will Only Get Worse II
Surveys spanning
electromagnetic spectrum
Combining these surveys is
hard: different sensitivities,
resolutions and physics
Mixture of imaging, catalogs
and spectra
Difference between
continuum and point
processes
Thousands of attributes per
source
What is VO?
The “Virtual Observatory” must:
 Federate multi-wavelength data sources
(interoperability)
 Must empower everyone (democratise)
 Be fast, distributed and easy
 Allow input and output
Computer Science + Statistics!
Scientists will need help through autonomous
scientific discovery of large, multi-dimensional,
correlated datasets
Scientists will need fast databases
Scientists will need distributed computing and fast
networks
Scientists will need new visualization tools
CS and Statistics looking for new challenges: Also
no data-rights & privacy issues
New breed of students needed with IT skills
Symbiotic Relationship
VO Prototype
Ideally we would like all parts of the VO to be web-servises
DB
http
C#
dym
.NET
http
dym
EM
Lessons We Learnt
Tough to marry research c code developed
under linux to MS (pointers to memory)
.NET has “unsafe” memory
.NET server is hard to set up!
Migrate to using VOTables to perform all I/O.
Have server running at CMU so we can control code
Very High Dimensions
Using LLE and Isomap;
looking for lower
dimensional manifolds in
higher dimensional spaces
500x2000 space
from SDSS
spectra
Summary
Era of New Cosmology: Massive data sources and
search for subtle features & high precision
measurements
Need new methods that scale into these new
regimes; ``a virtual universe’’ (students will need
different skills). Perfect synergy with Stats, CS,
Physics
Good algorithms are as good as faster and more
computers!
The “glue” to make a “virtual observatory” is hard
and complex. Don’t under-estimate the job
Are the Features Real? (FDR)!
This is an example of multiple
hypothesis testing e.g. is every point
consistent with a smooth p(k)?
Let us first look at a simulated example:
consider a 1000x1000 image with 40000
sources.
FDR
30389 1505
9611
958495
2sigma
31497 22728 8503
937272
Bonferroni
27137
960000
0
12863
FDR makes 15 times few mistakes for the same
power as traditional 2-sigma
Why? Controls a scientifically meaningful quantity:
FDR = No. of false discoveries/Total no. of discoveries
And it is adaptive to the size of the dataset
We used a FDR of 0.25
i.e. 25% of circled
Points are in error
Therefore, we can say with
statistical rigor that most of
these points a rejected and
are thus ``features’’
No single point is
a 3sigma deviation
New statistics has enabled an astronomical discovery
Download