- GeoInfo 2014

advertisement
Notes on Geographic Information
Systems, DBMS Technology,
and SciDB
or
A Tale of Dirt-Bags (Earth Scientists) and
Propeller-Heads (Computer Scientists)
Dr. Paul G. Brown
Paradigm4 / SciDB
Overview of Talk
• How did we get here?
• DBMS, GIS, Scientific and Statistical Data Management
• Pioneers and Pilgrims
• Dark Ages: Pre-Internet, Pre-Web, Sneaker-net and boxes full of
CDs
• Sequoia 2000
• Jim Gray, and the SLOAN Digital Sky Survey
• XLDB Conferences
• Science and Its Methods
• Why your skill sets will become lucrative (not just important).
• Quick Overview of SciDB
We are witnessing the rise of the Scientific Data Management System: a category of
applications that draw on the lessons of traditional IT, but focuses on the
requirements and methods of scientific data management and analysis.
In the (very) beginning …
• In the (very) beginning was the application …
•
•
•
•
Small set of files with (semi-)standardized internal format.
Large and complex libraries for accessing file content.
Simple(?) scripting languages for glue.
Examples: IMS + COBOL + JCL, NetCDF/HDF/FITS + C/C++ + Perl/Python etc
• Commercial Data Management: Rapid adoption of RDBMS / SQL. Why?
•
•
•
•
Ad hoc (for each task) data model requirement (no industry standards).
More demanding quality of service guarantees (transactions, access control).
Enormous pent up demand for data sharing and collaboration.
Commercial data management was process oriented.
• Scientific Data Management: Went a different route. Why?
•
•
•
Data consumers and producers in different communities. (Sneaker-net).
Science organized into project teams: goal oriented.
Technical innovation (algorithm development) as important as scientific progress.
Sequoia 2000
• 5 Year Investigation into Scientific Data Management
• 1991-1996 – University of California System, DARPA, Digital Equipment Corp.
• Collaboration between Computer Science types (Mike Stonebraker at UC
Berkeley, Jim Gray at DEC) and users of Geographic Data (Dozier/Frew at UC
Santa Barbara, UCLA and UC San Diego Climate Modelers)
• EOS-DIS Alternative Architecture Study
• First wide area network (connecting UC Campuses) at “T3” bandwidth
(100MBps)
• Postgres 4.3 – R-Trees, spatial types etc. Eventually, PostgreSQL and PostGIS
• The Propeller-heads and the Dirt-bags
• “Ignorance raised to the power of arrogance.” – James Frew
• Computer Scientists – “What do you mean your data’s square?”
• Earth Scientists – “What do you mean more than one person can read and
write the same data at the same time?”
(Hard-won) Lessons
• Collaboration isn’t easy …
• Different teams spoke different languages …
• … even within related scientific disciplines.
• Dirt-bags had more to gain than Propeller-heads
• Technology that enables collaboration.
• Ask questions (queries), don’t write programs.
• Propeller-heads had more to learn than Dirt-bags
• SQL might be a $10 billion market, but it doesn’t do:
• Image processing, numerical analysis, time-series, HDF, etc.
Strong Claim # 1 : Inter-disciplinary innovation is
necessary for us to make significant progress.
XLDB Conferences – 2008
• Bring together Propeller-head and Dirt-bags
• 2008 Thought: How to do next-gen Science?
– Large Hadron Collider
– Large Synoptic Survey Telescope
– Initial survey of science requirements that informed
the design and implementation of SciDB.
• 2009 Thought: What about Industrial Big Data
Users?
– Turned out, industrial data sizes will be 10x scientific!
– “Internet of things”
• Now 3 Annual Conferences – US, Europe, Asia
Big Science – Research Systems
How can we use DBMS technology to help Scientific data management?
• SLOAN-Digital Sky Survey (http://www.sdss.org/)
– A database of astronomical objects.
– Query-centric interfaces, web-facing APIs.
• TeraServer – (http://www.terraserver.com/)
– Point the “big eye” down
– Commercial application of remote sensing data.
• NIH – 1,000 Genomes Online
– Powered by SciDB since 2010
– 8T of data online, 3,000 analytic sessions per day.
– Growing as fast as they can …
Where are the Propeller-Heads?
Playing Football like Seven Year Olds
• Documents! Hadoop! Triple-Stores! Graph-DBs!
• Take a technology with proven value in a specific use-case …
• … declare it to be the Next Big Thing (it will crush SQL/RDBMS!) …
• … chase each new idea like seven-year-olds chase a ball.
• Roll the Clock Forward to 2014
•
Hadoop Providers are (Re-)Implementing SQL
HIVE, Cloudera’s Impala, Hortonworks Stinger, YARN + Spark
Strong Claim # 2: One size (one technical architecture) does NOT fit
all problem domains.
Jim Gray and the Fourth Paradigm
• Who was Jim?
• Turing Award Winner - 1998
• Architect of $erious $ystems
(Ultimate Propeller-head)
• What is the “Fourth Paradigm”?
• eScience
“Increasingly, scientific breakthroughs will be powered by
advanced computing capabilities that help researchers
manipulate and explore massive datasets.”
The “Big Idea” Slide
The methodologies used to analyze scientific data
are central to how we understand our world.
• Ubiquitous networks of sensors will render much of the world
an empirical or scientific phenomenon.
•
•
•
•
How to store all that data?
How do we share that data?
How will be reason about it?
How can such development be made to work economically?
“Increasingly, scientific breakthroughs life’s everyday decisions
will be powered by advanced computing capabilities that help
researchers everyone manipulate and explore massive datasets.”
Challenges
• Collaboration
• Overcoming the “language challenges” inherent when
attempting any inter-disciplinary project.
• Sensitivity to legal and ethical issues: privacy.
• Information Integration
• Technical standards for data communication.
• Data cleanliness, and identifying common information.
• Visualization and Simple (but not too simple!) Interfaces
• Nothing to add!
• Ubiquitous Availability
“We have to do better at producing tools to support the whole
research cycle—from data capture and data curation to data
analysis and data visualization.” (Gray’s Turing Lecture)
Ergo, SciDB
• How do you find out what Dirt-bags want?
?
You ask them!
Arrays (or Matrices) as the basic structural building block
Algebra of array manipulation operations as API
Distributed computation (cloud or cluster) for scale
Integrated processing and storage platform
Extensible framework (to allow for algorithm innovation)
Provenance (track data through its life-cycle) and no-overwrite storage
Client languages of choice: ‘R’, Python, not 4GLs or C/C++
In-situ data access (as well as providing a data store)
M. Stonebraker, J. Becla, D.J. DeWitt, K. Lim, D. Maier, O. Ratzesberger, and S.B. Zdonik, "Requirements
for Science Data Bases and SciDB", ;in Proc. CIDR, 2009
or
http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_26.pdf
if you are reading this in the 21st Century.
Why SciDB?
Big analytics
without big hassles
R, Python, Matlab, Julia,…
MPP
Storage
and
Compute
Array
data
model
Complex
analytics
Commodity clusters or cloud
SciDB – The (Very) Short Tour – 1.
CREATE ARRAY Example
<
data
: float >
[ X=0:*,1000,0, J=0:*,1000,0 ];
• Array Data Model
– “Data Management for Squares”
CREATE ARRAY geodata
<
track-index : intl6,
scanindex
: intl6,
height
: intl6,
sensorzenith : float, sensorazimuth : float,
range
: uint32, solarzenith : float, solarazimuth : float,
landseamask : uint8
>
[ longitude
= -1800000 : 1800000, 50000, 0
latitude
= -900000 : 900000, 50000, 0,
start_time
= 199900000000 : 201400000000, 1, 0,
platformid
= 0 : 1, 1, 0,
resolutionid = 0 : 2, 1, 0
];
Example Array from: Planthaber, Gary Lee, Jr. “MODBASE : a SciDBpowered system for large-scale distributed storage and analysis of MODIS
earth remote sensing data” MIT 2012
SciDB – The (Very) Short Tour – 2.
• Query Languages
– High Level API – What not How
SELECT SUM ( data ) AS Sum_Data
FROM between ( Example, 500, 500, 1500, 1500 );
SELECT MEDIAN( height ) AS Median_Height,
AVG ( height )
AS Avg_Height
FROM slice ( geodata, platformid, 3 )
WHERE sensorzenith < 35.0
REGRID AS ( PARTITION BY longitude 1000,
latitude 1000,
start_time geo_range ( ’10 days’ ) );
AQL looks a bit like SQL, but the underlying algebra is arrays, not sets.
SciDB – The (Very) Short Tour – 3.
•
AFL
project (
– Functional, array level manipulation
apply (
– Familiar to ‘R’ and Python Users
join (
SELECT MEDIAN( height ) AS Median_Height,
filter (
AVG ( height )
AS Avg_Height
Masks, name =‘California’
FROM slice ( geodata, platformid, 3 )
),
WHERE sensorzenith < 35.0
REGRID AS ( PARTITION BY
geodata
longitude 1000,
),
latitude 1000,
start_time geo_range ( ’10 days’ )
height_color,
);
calc_height_color(height)
)
);
Composible Query Languages allow you to build sophisticated
programs by combining simple building blocks.
SciDB Architecture
SciDB Client
( iquery, ‘R’, Java,
Python )
SciDB
Engine
1
SciDB Coordinator Node(s)
2
PostgreSQL
3
Persistent System
Catalog Service
Local
Store
PostgreSQL
Connection
SciDB InterNode
Communication
SciDB
Engine
SciDB
Engine
SciDB
Engine
SciDB
Engine
5
Local
Store
Local
Store
Local
Store
Local
Store
6
SciDB Node
SciDB Node
SciDB Node
SciDB Worker Nodes 4
SciDB Node
Scientific “Big Data”, “Big Analytics”
• Dark Matter Detector – LUX
– 1 TB per day
– 100 collaborators (research grants)
– Find “interesting” particle collisions in a barrel holding 370 liters of liquid
helium, where interesting events are very rare.
• Metabolic Atlas – Mass Spectrometry DB
– Genomics + Phenotype + Proteomics
– “What is alive in this drop of sea-water?”
• Next Generation Genome Sequencing
– Cost of sequencing a human genome is collapsing.
2000 - $1M, 2010 - $10K, 2015 - $1K
– Data per sequencing process is growing.
Ion Torrent Sequencing – 80 B reads of 400 bp / read @ $1 per M bp in 2 hours
– Gene sequencing be a routine part of medicine by the end of the decade.
Surprises Along the Way
• There is commercial demand for SciDB!
• Image processing applications in Radiology, Bio-IT.
• Remote sensing applications interesting to various Govt.
agencies and some commercial entities (agriculture, logistics).
• Geo-located sensors in vehicles; driver behavior for insurance.
• Arrays for more than just images
• Genome database: 2D array [ sample x base_pair ]
• Timeseries data: [ anything x time ]
• Graph Analytics: [ calling_phone_# x called_phone_# ]
• Traction on the “Scientific Warehouse”
• Cost savings by centralizing infrastructure
• Productivity advantages from cross-team collaboration
Strong Claim # 3: Tools and
methodologies that have traditionally
been restricted to “scientific”
research will become central to
“commercial” and “industrial” data
processing.
Conclusions
• Dirt-bags told the Propellor-heads what they wanted …
• Scalable, flexible array storage and data processing.
• Platform for collaborative analytics on machine-data.
• Propellor-heads responded …
• Not just SciDB!
• MonetDB, Rasdaman, InfluxDB – all array DBMSs
• Struggling to fit scientific data processing into other paradigms –
SQL + HFDS.
• And not a moment too soon!
• Shift of management approach in Big Science towards
shared infrastructure (cost saving, productivity).
• Multiple “commercial” consumers who need to use
scientific tools and methods in their analysis.
Download