Petascale Data Science Challenges in Astronomy

advertisement
(the Borne Identity)
Data Literacy for all !
(the Borne ultimatum)
Knowledge Discovery from
Mining Big Data
Kirk Borne
@KirkDBorne
George Mason University
School of Physics, Astronomy, & Computational Sciences
http://classweb.gmu.edu/kborne/
The Big Data Manifesto
(the Borne Ultimatum)
•
More data is not just more data … more is different!
• Discover the unknown unknowns.
• Address massive Data-to-Knowledge (D2K) challenge.
• Data Literacy for all !
2
Ever since we first began to explore our world…
… humans have asked questions and …
We have collected evidence (data) to help answer those questions.
… humans have asked questions and …
We have collected evidence (data) to help answer those questions.
The journey from
traditional
science to …
… Data-intensive Science is a Big Challenge
6
http://www.economist.com/specialreports/displaystory.cfm?story_id=15557443
Scary News:
Big Data is taking us to
a Tipping Point
http://bit.ly/HUqmu5
http://goo.gl/Aj30t
8
Promising News: Big Data leads to
Big Insights and New Discoveries
http://news.nationalgeographic.com/news/2010/11/photogalleries/101103-nasa-space-shuttle-discovery-firsts-pictures/
9
Good News: Big Data is Sexy
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1
http://dilbert.com/strips/comic/2012-09-05/
10
Characteristics of Big Data – 1234
• Computing power doubles every 18 months (Moore’s Law) ...
How
much data are there in the world?
• 100x improvement in 10 years
 From the beginning of recorded time until 2003,
• I/O bandwidth increases ~10% / year
we created 5 billion gigabytes (exabytes) of data.
• <3x improvement in 10 years.
 In 2011 the same amount was created every two days.
• The
amount
dataamount
doublesisevery
year
... 10 minutes.
In 2013,
the of
same
created
every
• 1000x in 10 years, and 1,000,000x in 20 yrs.
http://money.cnn.com/gallery/technology/2012/09/10/big-data.fortune/index.html
11
Characteristics of Big Data – 1234
• Computing power doubles every 18 months (Moore’s Law) ...
• 100x improvement in 10 years
• The amount of data doubles every year (or faster!) ...
• 1000x in 10 years, and 1,000,000x in 20 yrs.
• I/O bandwidth increases ~10% / year
• <3x improvement in 10 years.
12
Characteristics of Big Data – 1234
• Computing power doubles every 18 months (Moore’s Law) ...
• 100x improvement in 10 years
• The amount of data doubles every year ...
• 1000x in 10 years, and 1,000,000x in 20 yrs.
• I/O bandwidth increases ~10% / year
Moore’s Law
of Slacking
will not help !
http://arxiv.org/abs/astro-ph/9912202
• <3x improvement in 10 years.
13
Characteristics of Big Data – 1234
• Big quantities of data are acquired everywhere.
• It is now a big issue in all aspects of life: science, business,
healthcare, gov, social networks, national security, media, etc.
Characteristics of Big Data – 1234
• Big quantities of data are acquired everywhere.
• It is now a big issue in all aspects of life: science, business,
healthcare, gov, social networks, national security, media, etc.
LSST project (www.lsst.org) :
•
•
•
•
20 Terabytes of astronomical imaging every night
100-200 Petabyte image archive after 10 years
20-40 Petabyte database
2-10 million new sky events nightly that need to be
characterized and classified – potential new discoveries!
Characteristics of Big Data – 1234
•
•
•
•
Job opportunities are sky-rocketing
Extremely high demand for Data Science skills
Demand will continue to increase
Old: “100 applicants per job”. New: “100 jobs per applicant”
Characteristics of Big Data – 1234
•
•
•
•
Job opportunities are sky-rocketing
Extremely high demand for Data Science skills
Demand will continue to increase
Old: “100 applicants per job”. New: “100 jobs per applicant”
McKinsey Report (2011**) :
• Big Data is the new “gold rush” , the “new oil”
• 1.5 million skilled data scientist shortage within 5 years
• ** http://www.mckinsey.com/insights/mgi/research/technology_and_innovation/big_data_the_next_frontier_for_innovation
Data Sciences: A National Imperative
1. National Academies report: Bits of Power: Issues in Global Access to Scientific Data, (1997) http://www.nap.edu/catalog.php?record_id=5504
2. NSF (National Science Foundation) report: Knowledge Lost in Information: Research Directions for Digital Libraries, (2003) downloaded from
http://www.sis.pitt.edu/~dlwkshop/report.pdf
3. NSF report: Cyberinfrastructure for Environmental Research and Education, (2003) downloaded from
http://www.ncar.ucar.edu/cyber/cyberreport.pdf
4. NSB (National Science Board) report: Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century, (2005)
downloaded from http://www.nsf.gov/nsb/documents/2005/LLDDC_report.pdf
5. NSF report with the Computing Research Association: Cyberinfrastructure for Education and Learning for the Future: A Vision and Research
Agenda, (2005) downloaded from http://www.cra.org/reports/cyberinfrastructure.pdf
6. NSF Atkins Report: Revolutionizing Science & Engineering Through Cyberinfrastructure: Report of the NSF Blue-Ribbon Advisory Panel on
Cyberinfrastructure, (2005) downloaded from http://www.nsf.gov/od/oci/reports/atkins.pdf
7. NSF report: The Role of Academic Libraries in the Digital Data Universe, (2006) downloaded from http://www.arl.org/bm~doc/digdatarpt.pdf
8. NSF report: Cyberinfrastructure Vision for 21st Century Discovery, (2007) downloaded from http://www.nsf.gov/od/oci/ci_v5.pdf
9. JISC/NSF Workshop report on Data-Driven Science & Repositories, (2007) downloaded from
http://www.sis.pitt.edu/~repwkshop/NSFJISC-report.pdf
10. DOE report: Visualization and Knowledge Discovery: Report from the DOE/ASCR Workshop on Visual Analysis and Data Exploration at
Extreme Scale, (2007) downloaded from http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/DOE-Visualization-Report-2007.pdf
11. DOE report: Mathematics for Analysis of Petascale Data Workshop Report, (2008) downloaded from
http://www.sc.doe.gov/ascr/ProgramDocuments/Docs/PetascaleDataWorkshopReport.pdf
12. NSTC Interagency Working Group on Digital Data report: Harnessing the Power of Digital Data for Science and Society, (2009) downloaded
from http://www.nitrd.gov/about/Harnessing_Power_Web.pdf
13. National Academies report: Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age, (2009) downloaded
from http://www.nap.edu/catalog.php?record_id=12615
14. NSF report: Data-Enabled Science in the Mathematical and Physical Sciences, (2010) downloaded from
http://www.cra.org/ccc/docs/reports/DES-report_final.pdf
15. National Big Data Research and Development Initiative, (2012) downloaded from
http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data_press_release_final_2.pdf
The Fourth Paradigm: Data-Intensive
Scientific Discovery
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
The 4 Scientific
Paradigms:
1.
2.
3.
4.
Experiment (sensors)
Theory (modeling)
Simulation (HPC)
Data Exploration (KDD)
Characteristics of Big Data – 1234
• The emergence of Data Science and Data-Oriented Science (the 4th
paradigm of science).

“Computational literacy and data literacy are critical for all.” - Kirk Borne
20
Characteristics of Big Data – 1234
• The emergence of Data Science and Data-Oriented Science (the 4th
paradigm of science).

“Computational literacy and data literacy are critical for all.” - Kirk Borne
• A complete data collection on any complex domain (e.g., Earth, or
the Universe, or the Human Body) has the potential to encode the
knowledge of that domain, waiting to be mined and discovered.

“Somewhere, something incredible is waiting to be known.” - Carl Sagan
21
Characteristics of Big Data – 1234
• The emergence of Data Science and Data-Oriented Science (the 4th
paradigm of science).

“Computational literacy and data literacy are critical for all.” - Kirk Borne
• A complete data collection on any complex domain (e.g., Earth, or
the Universe, or the Human Body) has the potential to encode the
knowledge of that domain, waiting to be mined and discovered.

“Somewhere, something incredible is waiting to be known.” - Carl Sagan
• We call this “X-Informatics”: addressing the D2K (Data-toKnowledge) Challenge in any discipline X using Data Science.
• Examples: Astroinformatics, Bioinformatics, Geoinformatics,
Climate Informatics, Ecological Informatics, Biodiversity
Informatics, Environmental Informatics, Health Informatics,
Medical Informatics, Neuroinformatics, Crystal Informatics,
Cheminformatics, Discovery Informatics, and more.
22
Characterizing the Big Data Hype

If the only distinguishing characteristic was that we
have lots of data, we would call it “Lots of Data”.
23
Characterizing the Big Data Hype


If the only distinguishing characteristic was that we
have lots of data, we would call it “Lots of Data”.
Big Data characteristics: the 3+n V’s =
1.
2.
3.
4.
5.
6.
7.
8.
Volume (lots of data = “Tonnabytes”)
Variety (complexity, curse of dimensionality)
Velocity (rate of data and information flow)
V
V
V
V
V
24
Characterizing the Big Data Hype


If the only distinguishing characteristic was that we
have lots of data, we would call it “Lots of Data”.
Big Data characteristics: the 3+n V’s =
1.
2.
3.
4.
5.
6.
7.
8.
Volume (lots of data = “Tonnabytes”)
Variety (complexity, curse of dimensionality)
Velocity (rate of data and information flow)
Veracity
Variability
Venue
Vocabulary
Value
25
Characterizing the Big Data Hype


If the only distinguishing characteristic was that we
have lots of data, we would call it “Lots of Data”.
Big Data characteristics: the 3+n V’s =
1.
2.
3.
4.
Volume (lots of data = “Tonnabytes”)
Variety (complexity, curse of dimensionality)
Velocity (rate of data and information flow)
Veracity (verifying inference-based models from
comprehensive data collections)
5.
6.
7.
8.
Variability
Venue
Vocabulary
Value
26
Characterizing the Big Data Hype


If the only distinguishing characteristic was that we
have lots of data, we would call it “Lots of Data”.
Big Data characteristics: the 3+n V’s =
Volume (lots of data = “Tonnabytes”)
2. Variety (complexity, curse of dimensionality)
3. Velocity (rate of data and information flow)
4. Veracity (verifying inference-based models from
comprehensive data collections) … as I said earlier:
5. Variability
A complete data collection on any complex domain
6.
Venue
(e.g.,
Earth, or the Universe, or the Human Body) has
theVocabulary
potential to encode the knowledge of that domain,
7.
waiting to be mined and discovered.
8. Value
1.
27
Characterizing the Big Data Hype


If the only distinguishing characteristic was that we
have lots of data, we would call it “Lots of Data”.
Big Data characteristics: the 3+n V’s =
BigVolume
Data Example :
2. Variety : this one helps us to discriminate subtle
new classes (= Class Discovery)
3. Velocity
4. Veracity
5. Variability
6. Venue
7. Vocabulary
8. Value
28
1.
Insufficient Variety: stars & galaxies
are not separated in this parameter
29
Sufficient Variety: stars & galaxies
are separated in this parameter
30
4 Categories of Scientific KDD
(Knowledge Discovery in Databases)

Class Discovery



Novelty Discovery


Finding new, rare, one-in-a-million(billion)(trillion)
objects and events
Correlation Discovery


Finding new classes of objects and behaviors
Learning the rules that constrain the class boundaries
Finding new patterns and dependencies, which reveal
new natural laws or new scientific principles
Association Discovery

Finding unusual (improbable) co-occurring associations
31
This graphic says it all …
• Clustering – examine
the data and find the
data clusters (clouds),
without considering
what the items are =
Characterization !
• Classification – for
each new data item,
try to place it within a
known class (i.e., a
known category or
cluster) = Classify !
• Outlier Detection –
identify those data
items that don’t fit into
the known classes or
clusters = Surprise !
Graphic provided by Professor S. G. Djorgovski, Caltech
32
Scientists have been doing
Data Mining for centuries
“The data are mine, and
you can’t have them!”
• Seriously ...
• Scientists love to classify things ...
(Supervised Learning. e.g., classification)
• Scientists love to characterize things ...
(Unsupervised Learning. e.g., clustering)
• And we love to discover new things ...
(Semi-supervised Learning. e.g., outlier detection)
33
Data-Driven Discovery:
Scientific KDD (Knowledge Discovery from Data)
1.
2.
3.
4.
Class Discovery
Novelty Discovery
Correlation Discovery
Association Discovery
Graphic
Graphic from
from S.
S. G.
G. Djorgovski
Djorgovski
• Benefits of very large datasets:
• best statistical analysis of “typical” events
• automated search for “rare” events
Scientific Data-to-Knowledge Problem 1-a
• The Class Discovery Problem : (clustering)
– Find distinct clusters of multivariate scientific parameters
that separate objects within a data set.
– Find new classes of objects or new behaviors.
– What is the significance of the clusters (statistically and
scientifically)?
– What is the optimal algorithm for finding friends-of-friends
or nearest neighbors in very high dimensions (complex
data with Variety)?
• N is >1010, so what is the most efficient way to sort?
• Number of dimensions > 1000 – therefore, we have an enormous
subspace search problem
Scientific Data-to-Knowledge Problem 1-b
• The superposition / decomposition problem:
– Finding the parameters or combinations of parameters
(out of 100’s or 1000’s) that most cleanly and optimally
(parsimoniously) distinguish different object classes
– What if there are 1010 objects that overlap in a 103-D
parameter space?
– What is the optimal way to separate and accurately
classify the different unique classes of objects?
Class Discovery: feature separation
and discrimination of classes
Reference: http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/BrunnerDPS.pdf

The separation of classes improves when the “correct” features
are chosen for investigation, as in the following star-galaxy
discrimination test: the “Star-Galaxy Separation” Problem
Not good
Good
37
Scientific Data-to-Knowledge Problem 1234
• The Novelty Discovery Problem :
– Anomaly Detection, Deviation Detection, Surprise Discovery,
Novelty Discovery: Finding objects and events that are outside
the bounds of our expectations (outside known clusters)
– Finding new, rare, one-in-a-million(billion)(trillion) objects and
events – Finding the Unknown Unknowns
– These may be real scientific discoveries or garbage
– Outlier detection is therefore useful for:
• Anomaly Detection – is the detector system working?
• Data Quality Assurance – is the data pipeline working?
• Novelty Discovery – is my Nobel prize waiting?
– How does one optimally find outliers in 103-D parameter space?
or in interesting subspaces (in lower dimensions)?
– How do we measure their “interestingness”?
Novelty Discovery: Improved Discovery
of Rare Objects or Events across
Multiple Data Sources
39
Scientific Data-to-Knowledge Problem 1234
• The Correlation Discovery Problem =
Dimension Reduction Problem:
– Finding new correlations and “fundamental planes” of parameters.
– Such correlations, patterns, and
dependencies may reveal new
physics or new scientific relations.
– Number of attributes can be
hundreds or thousands =
• The Curse of High
Dimensionality !
– Are there eigenvectors or
condensed representations (e.g.,
basis sets) that represent the full
set of properties?
Fundamental Plane for 156,000 Elliptical Galaxies:
plot shows variance captured by first 2 Principal
Components as a function of local galaxy density.
% of variance captured by PC1+PC2
Reference: Borne, Dutta, Giannella, Kargupta, & Griffin 2008
Slide Content

•
•
•
Slide content
Slide content
Slide content
low
(Local Galaxy Density)
high
41
Scientific Data-to-Knowledge Problem 1234
• The Association Discovery Problem :
Link Analysis – Network Analysis –
Graph Mining
 Identify connections between different events (or
objects)
 Find unusual (improbable) co-occurring combinations
of data attribute values
 Find data items that have much fewer than “6 degrees
of separation”
 Identifying such connectivity in our scientific
databases and knowledge repositories can lead to
new insights, new knowledge, new discoveries.
There are many technologies
associated with Big Data
http://siliconangle.com/blog/2012/07/13/big-data-nightmares/
43
One approach to Big Data:
Computational Science (Hadoop,Map/Reduce)
http://www.bigdatabytes.com/wp-content/uploads/2012/01/big-data.jpg
44
Another approach to Big Data:
Data Science (Informatics)
45
A third approach to Big Data:
Citizen Science (crowdsourcing)
46
Galaxy Zoo: example of
Citizen Science (crowdsourcing)
http://astrophysics.gsfc.nasa.gov/outreach/podcast/wordpress/index.php/2010/10/08/saras-blog-be-a-scientist
/
47
Astroinformatics Research paper available !
Addresses the data science challenges, research agenda, application areas,
use cases, and recommendations for the new science of Astroinformatics.
Borne (2010): “Astroinformatics: Data-Oriented Astronomy Research and
Education”, Journal of Earth Science Informatics, vol. 3, pp. 5-17.
See also http://arxiv.org/abs/0909.3892
LSST =
Large
Synoptic
Survey
Telescope
8.4-meter diameter
primary mirror =
10 square degrees!
http://www.lsst.org/
Hello !
– 100-200 Petabyte image archive
– 20-40 Petabyte database catalog
49
Observing Strategy: One pair of images every 40 seconds for each spot on the sky,
then continue across the sky continuously every night for 10 years (~2021-2031), with
time domain sampling in log(time) intervals (to capture dynamic range of transients).
•
LSST (Large Synoptic Survey Telescope):
– Ten-year time series imaging of the night sky – mapping the Universe !
– ~2,000,000 events each night – anything that goes bump in the night !
– Cosmic Cinematography! The New Sky! @ http://www.lsst.org/
50
The LSST Informatics and Statistics Science
Collaboration (ISSC) Research Team
• The ISSC team focuses on several research areas:
– Statistics
– Data & Information Visualization
– Data mining (machine learning)
Informatics
– Data-intensive computing & analysis
– Large-scale scientific data management
• These areas represent Statistics and the science of
Informatics (Astroinformatics) = Data-intensive Science
= the 4th Paradigm of Scientific Research
– Addressing the LSST “Data to Knowledge” challenge
– Helping to discover the unknown unknowns
51
The LSST ISSC Research Team
• Chairperson: K.Borne, GMU
• Core team: 3 astronomers + 2 =
–
–
–
–
K.Borne (astroinformatics)
Eric Feigelsen, Tom Loredo (astrostatistics)
Jogesh Babu (statistics)
Alex Gray (computer science, data mining)
• Full team: 41 scientists
– ~60% astronomers
– ~30% statisticians
– ~10% data mining, machine learning computer scientists
http://www.lsstcorp.org/ScienceCollaborators/ScienceMembers.php
http://aurora.gmu.edu/~kborne/LSST-Informatics-and-Statistics.pdf
52
Some key astronomy problems that require
informatics and statistical techniques …
Astroinformatics & Astrostatistics!
•
•
•
•
•
•
•
•
•
•
•
•
•
Probabilistic Cross-Matching of objects from different catalogues
The distance problem (e.g., Photometric Redshift estimators)
Star-Galaxy separation ; QSO-Star separation
Cosmic-Ray Detection in images
Supernova Detection and Classification
Morphological Classification (galaxies, AGN, gravitational lenses, ...)
Class and Subclass Discovery (brown dwarfs, methane dwarfs, ...)
Dimension Reduction = Correlation Discovery
Learning Rules for improved classifiers
Classification of massive data streams
Real-time Classification of Astronomical Events
Clustering of massive data collections
Novelty, Anomaly, Outlier Detection in massive databases
http://aurora.gmu.edu/~kborne/LSST-Informatics-and-Statistics.pdf
53
ISSC “current topics”
• Advancing the field = Community-building:
•
•
Astroinformatics + Astrostatistics (several workshops this year!!)
Education, education, education! (Citizen Science, undergrad+grad ed…)
• LSST Event Characterization vs. Classification
• Sparse time series and the LSST observing cadence
• Challenge Problems, such as the Photo-z challenge and the Supernova
Photometric Classification challenge
• Testing algorithms on the LSST simulations: images/catalogs PLUS
observing cadence – can we recover known classes of variability?
• Generating and/or accumulating training samples of numerous classes
(especially variables and transients)
• Proposing a mini-survey during the science verification year (Science
Commissioning):
•
e.g., high-density and evenly-spaced observations of extragalactic and
Galactic test fields are obtained, to generate training sets for variability
classification and assessment thereof
• Science Data Quality Assessment (SDQA): R&D efforts to support LSST
Data Management team
http://aurora.gmu.edu/~kborne/LSST-Informatics-and-Statistics.pdf
54
Agenda of ISSC Workshop at
LSST Project Meeting, August 2012
Brief talks by team members:
– Kirk Borne: Outlier Detection for Surprise Discovery in Big Data
– Jogesh Babu: Statistical Resources
– Nathan De Lee: The VIDA Astroinformatics Portal
– Matthew Graham: Characterizing and Classifying CRTS
– Joseph Richards: Time-Domain Discovery and Classification
– Sam Schmidt: Upcoming Challenges for Photometric Redshifts
– Lior Shamir: Automatic Analysis of Galaxy Morphology
– John Wallin: Citizen Science and Machine Learning
– Jake Vanderplas: AstroML – Machine Learning for Astronomy
http://aurora.gmu.edu/~kborne/LSST-Informatics-and-Statistics.pdf
55
Why do all of this?
… for 4 very simple reasons:
• (1) Any real data collection may consist of
millions, or billions, or trillions of sampled
data points.
• (2) Any real data set will probably have
many hundreds (or thousands) of measured
attributes (features, dimensions).
• (3) Humans can make mistakes when
staring for hours at long lists of numbers,
especially in a dynamic data stream.
• (4) The use of a data-driven model provides
an objective, scientific, rational, and
justifiable test of a hypothesis.
56
Why do all of this?
… for 4 very simple reasons:
• (1) Any real data collection may consist of
millions, or billions, or trillions of sampled
Volume
data points.
• (2) Any real data set will probably have
many hundreds (or thousands) of measured
Variety
attributes (features, dimensions).
• (3) Humans can make mistakes when
staring for hours at long lists of numbers,
Velocity
especially in a dynamic data stream.
• (4) The use of a data-driven model provides
an objective, scientific, rational, and
Veracity
justifiable test of a hypothesis.
57
Why do all of this?
… for 4 very simple reasons:
• (1) Any real data collection may consist of
millions, or
billions,
or trillions
of sampled
Volume
It is
too much
!
data points.
• (2) Any real data set will probably have
many hundreds
thousands)
of measured
It is too(or
complex
!
Variety
attributes (features, dimensions).
• (3) Humans can make mistakes when
staring forIt hours
lists !of numbers,
Velocity
keeps at
onlong
coming
especially in a dynamic data stream.
• (4) The use of a data-driven model provides
an objective,
and ?
Canscientific,
you proverational,
your results
Veracity
justifiable test of a hypothesis.
58
Knowledge Discovery from Mining Big Data – 12
• By collecting a thorough set of parameters
(high-dimensional data) for a complete set of
items within our domain of study, we should be
able to build a “perfect” statistical model for that
domain.
• In other words, Big Data becomes the model for
a domain X = we call this X-informatics.
• Anything we want to know about that domain is
specified and encoded within the data.
• The goal of Big Data Science is to find those
encodings, patterns, and knowledge nuggets.
• Example : Big-Data Vision? Whole-population analytics
59
Knowledge Discovery from Mining Big Data – 12
• By collecting a thorough set of parameters
(high-dimensional data) for a complete set of
items within our domain of study, we should be
able to build a “perfect” statistical model for that
domain.
• In other words, Big Data becomes the model for
a domain X = we call this X-informatics.
• Anything we want to know about that domain is
specified and encoded within the data.
• The goal of Big Data Science is to find those
encodings, patterns, and knowledge nuggets.
• Recall what we said before …
60
Knowledge Discovery from Mining Big Data – 12
• … one of the two major benefits of BIG DATA is
to provide the best statistical analysis ever(!) for
the domain of study.
Benefits of very large datasets:
1. best statistical analysis of
“typical” events
2. automated search for
“rare” events
61
Download