Notes on Geographic Information Systems, DBMS Technology, and SciDB or A Tale of Dirt-Bags (Earth Scientists) and Propeller-Heads (Computer Scientists) Dr. Paul G. Brown Paradigm4 / SciDB Overview of Talk • How did we get here? • DBMS, GIS, Scientific and Statistical Data Management • Pioneers and Pilgrims • Dark Ages: Pre-Internet, Pre-Web, Sneaker-net and boxes full of CDs • Sequoia 2000 • Jim Gray, and the SLOAN Digital Sky Survey • XLDB Conferences • Science and Its Methods • Why your skill sets will become lucrative (not just important). • Quick Overview of SciDB We are witnessing the rise of the Scientific Data Management System: a category of applications that draw on the lessons of traditional IT, but focuses on the requirements and methods of scientific data management and analysis. In the (very) beginning … • In the (very) beginning was the application … • • • • Small set of files with (semi-)standardized internal format. Large and complex libraries for accessing file content. Simple(?) scripting languages for glue. Examples: IMS + COBOL + JCL, NetCDF/HDF/FITS + C/C++ + Perl/Python etc • Commercial Data Management: Rapid adoption of RDBMS / SQL. Why? • • • • Ad hoc (for each task) data model requirement (no industry standards). More demanding quality of service guarantees (transactions, access control). Enormous pent up demand for data sharing and collaboration. Commercial data management was process oriented. • Scientific Data Management: Went a different route. Why? • • • Data consumers and producers in different communities. (Sneaker-net). Science organized into project teams: goal oriented. Technical innovation (algorithm development) as important as scientific progress. Sequoia 2000 • 5 Year Investigation into Scientific Data Management • 1991-1996 – University of California System, DARPA, Digital Equipment Corp. • Collaboration between Computer Science types (Mike Stonebraker at UC Berkeley, Jim Gray at DEC) and users of Geographic Data (Dozier/Frew at UC Santa Barbara, UCLA and UC San Diego Climate Modelers) • EOS-DIS Alternative Architecture Study • First wide area network (connecting UC Campuses) at “T3” bandwidth (100MBps) • Postgres 4.3 – R-Trees, spatial types etc. Eventually, PostgreSQL and PostGIS • The Propeller-heads and the Dirt-bags • “Ignorance raised to the power of arrogance.” – James Frew • Computer Scientists – “What do you mean your data’s square?” • Earth Scientists – “What do you mean more than one person can read and write the same data at the same time?” (Hard-won) Lessons • Collaboration isn’t easy … • Different teams spoke different languages … • … even within related scientific disciplines. • Dirt-bags had more to gain than Propeller-heads • Technology that enables collaboration. • Ask questions (queries), don’t write programs. • Propeller-heads had more to learn than Dirt-bags • SQL might be a $10 billion market, but it doesn’t do: • Image processing, numerical analysis, time-series, HDF, etc. Strong Claim # 1 : Inter-disciplinary innovation is necessary for us to make significant progress. XLDB Conferences – 2008 • Bring together Propeller-head and Dirt-bags • 2008 Thought: How to do next-gen Science? – Large Hadron Collider – Large Synoptic Survey Telescope – Initial survey of science requirements that informed the design and implementation of SciDB. • 2009 Thought: What about Industrial Big Data Users? – Turned out, industrial data sizes will be 10x scientific! – “Internet of things” • Now 3 Annual Conferences – US, Europe, Asia Big Science – Research Systems How can we use DBMS technology to help Scientific data management? • SLOAN-Digital Sky Survey (http://www.sdss.org/) – A database of astronomical objects. – Query-centric interfaces, web-facing APIs. • TeraServer – (http://www.terraserver.com/) – Point the “big eye” down – Commercial application of remote sensing data. • NIH – 1,000 Genomes Online – Powered by SciDB since 2010 – 8T of data online, 3,000 analytic sessions per day. – Growing as fast as they can … Where are the Propeller-Heads? Playing Football like Seven Year Olds • Documents! Hadoop! Triple-Stores! Graph-DBs! • Take a technology with proven value in a specific use-case … • … declare it to be the Next Big Thing (it will crush SQL/RDBMS!) … • … chase each new idea like seven-year-olds chase a ball. • Roll the Clock Forward to 2014 • Hadoop Providers are (Re-)Implementing SQL HIVE, Cloudera’s Impala, Hortonworks Stinger, YARN + Spark Strong Claim # 2: One size (one technical architecture) does NOT fit all problem domains. Jim Gray and the Fourth Paradigm • Who was Jim? • Turing Award Winner - 1998 • Architect of $erious $ystems (Ultimate Propeller-head) • What is the “Fourth Paradigm”? • eScience “Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets.” The “Big Idea” Slide The methodologies used to analyze scientific data are central to how we understand our world. • Ubiquitous networks of sensors will render much of the world an empirical or scientific phenomenon. • • • • How to store all that data? How do we share that data? How will be reason about it? How can such development be made to work economically? “Increasingly, scientific breakthroughs life’s everyday decisions will be powered by advanced computing capabilities that help researchers everyone manipulate and explore massive datasets.” Challenges • Collaboration • Overcoming the “language challenges” inherent when attempting any inter-disciplinary project. • Sensitivity to legal and ethical issues: privacy. • Information Integration • Technical standards for data communication. • Data cleanliness, and identifying common information. • Visualization and Simple (but not too simple!) Interfaces • Nothing to add! • Ubiquitous Availability “We have to do better at producing tools to support the whole research cycle—from data capture and data curation to data analysis and data visualization.” (Gray’s Turing Lecture) Ergo, SciDB • How do you find out what Dirt-bags want? ? You ask them! Arrays (or Matrices) as the basic structural building block Algebra of array manipulation operations as API Distributed computation (cloud or cluster) for scale Integrated processing and storage platform Extensible framework (to allow for algorithm innovation) Provenance (track data through its life-cycle) and no-overwrite storage Client languages of choice: ‘R’, Python, not 4GLs or C/C++ In-situ data access (as well as providing a data store) M. Stonebraker, J. Becla, D.J. DeWitt, K. Lim, D. Maier, O. Ratzesberger, and S.B. Zdonik, "Requirements for Science Data Bases and SciDB", ;in Proc. CIDR, 2009 or http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_26.pdf if you are reading this in the 21st Century. Why SciDB? Big analytics without big hassles R, Python, Matlab, Julia,… MPP Storage and Compute Array data model Complex analytics Commodity clusters or cloud SciDB – The (Very) Short Tour – 1. CREATE ARRAY Example < data : float > [ X=0:*,1000,0, J=0:*,1000,0 ]; • Array Data Model – “Data Management for Squares” CREATE ARRAY geodata < track-index : intl6, scanindex : intl6, height : intl6, sensorzenith : float, sensorazimuth : float, range : uint32, solarzenith : float, solarazimuth : float, landseamask : uint8 > [ longitude = -1800000 : 1800000, 50000, 0 latitude = -900000 : 900000, 50000, 0, start_time = 199900000000 : 201400000000, 1, 0, platformid = 0 : 1, 1, 0, resolutionid = 0 : 2, 1, 0 ]; Example Array from: Planthaber, Gary Lee, Jr. “MODBASE : a SciDBpowered system for large-scale distributed storage and analysis of MODIS earth remote sensing data” MIT 2012 SciDB – The (Very) Short Tour – 2. • Query Languages – High Level API – What not How SELECT SUM ( data ) AS Sum_Data FROM between ( Example, 500, 500, 1500, 1500 ); SELECT MEDIAN( height ) AS Median_Height, AVG ( height ) AS Avg_Height FROM slice ( geodata, platformid, 3 ) WHERE sensorzenith < 35.0 REGRID AS ( PARTITION BY longitude 1000, latitude 1000, start_time geo_range ( ’10 days’ ) ); AQL looks a bit like SQL, but the underlying algebra is arrays, not sets. SciDB – The (Very) Short Tour – 3. • AFL project ( – Functional, array level manipulation apply ( – Familiar to ‘R’ and Python Users join ( SELECT MEDIAN( height ) AS Median_Height, filter ( AVG ( height ) AS Avg_Height Masks, name =‘California’ FROM slice ( geodata, platformid, 3 ) ), WHERE sensorzenith < 35.0 REGRID AS ( PARTITION BY geodata longitude 1000, ), latitude 1000, start_time geo_range ( ’10 days’ ) height_color, ); calc_height_color(height) ) ); Composible Query Languages allow you to build sophisticated programs by combining simple building blocks. SciDB Architecture SciDB Client ( iquery, ‘R’, Java, Python ) SciDB Engine 1 SciDB Coordinator Node(s) 2 PostgreSQL 3 Persistent System Catalog Service Local Store PostgreSQL Connection SciDB InterNode Communication SciDB Engine SciDB Engine SciDB Engine SciDB Engine 5 Local Store Local Store Local Store Local Store 6 SciDB Node SciDB Node SciDB Node SciDB Worker Nodes 4 SciDB Node Scientific “Big Data”, “Big Analytics” • Dark Matter Detector – LUX – 1 TB per day – 100 collaborators (research grants) – Find “interesting” particle collisions in a barrel holding 370 liters of liquid helium, where interesting events are very rare. • Metabolic Atlas – Mass Spectrometry DB – Genomics + Phenotype + Proteomics – “What is alive in this drop of sea-water?” • Next Generation Genome Sequencing – Cost of sequencing a human genome is collapsing. 2000 - $1M, 2010 - $10K, 2015 - $1K – Data per sequencing process is growing. Ion Torrent Sequencing – 80 B reads of 400 bp / read @ $1 per M bp in 2 hours – Gene sequencing be a routine part of medicine by the end of the decade. Surprises Along the Way • There is commercial demand for SciDB! • Image processing applications in Radiology, Bio-IT. • Remote sensing applications interesting to various Govt. agencies and some commercial entities (agriculture, logistics). • Geo-located sensors in vehicles; driver behavior for insurance. • Arrays for more than just images • Genome database: 2D array [ sample x base_pair ] • Timeseries data: [ anything x time ] • Graph Analytics: [ calling_phone_# x called_phone_# ] • Traction on the “Scientific Warehouse” • Cost savings by centralizing infrastructure • Productivity advantages from cross-team collaboration Strong Claim # 3: Tools and methodologies that have traditionally been restricted to “scientific” research will become central to “commercial” and “industrial” data processing. Conclusions • Dirt-bags told the Propellor-heads what they wanted … • Scalable, flexible array storage and data processing. • Platform for collaborative analytics on machine-data. • Propellor-heads responded … • Not just SciDB! • MonetDB, Rasdaman, InfluxDB – all array DBMSs • Struggling to fit scientific data processing into other paradigms – SQL + HFDS. • And not a moment too soon! • Shift of management approach in Big Science towards shared infrastructure (cost saving, productivity). • Multiple “commercial” consumers who need to use scientific tools and methods in their analysis.