SciDB An Open Source Data Base Project by Michael Stonebraker (and others) Outline Why science folks are unhappy with RDBMS How we plan to fix that The details Why SciDB? “Big science” very unhappy with RDBMS Astronomy HEP Fusion Bio Remote sensing Why? Experience Tried to use Postgres for science databases Failed of Sequoia 2000 (mid 1990s) badly…… Main science data type is an array – horribly inefficient to simulate arrays on top of tables Required features absent (provenance, uncertainty, version control) SQL operations wrong (regrid – not join) Why SciDB? Net result Mentality of “roll your own from the ground up” for every new science project Realization by the science community that this is long-term suicide Community wants to get behind something better Great commonality of needs among domains A Little Context XLDB-1 Genesis Asilomar Small of the need conference (March 2008) conference to generate requirements A Little Context March 2008 – September 2008 Initial design completed Fund raising Recruiting of initial team Detailed use cases specified Our Partnership Science and high-end commercial folks Who will put up some resources And review design DBMS brain trust Who will design the system, oversee its construction, and perform needed research Non-profit company Which will manage the open source project And support the resulting system May need long term funding help Partners – Science (We are recruiting more….) LSST astronomy project DBMS work co-ordinated by SLAC Pacific Northwest National Laboratory (PNNL) Various bio projects Lawrence Livermore National Laboratory Fusion projects UCSB Remote sensing Partners -- DBMS Mike Stonebraker (MIT) Dave DeWitt (Wisconsin -> Microsoft) Jignesh Patel (Wisconsin) Jennifer Widom (Stanford) Dave Maier (Portland State) Stan Zdonik (Brown) Sam Madden (MIT) Ugur Cetintemal (Brown) Magda Mike Balazinska (Washington) Carey (UCI) Partners -- Other E-Bay Vertica Microsoft LSST SLAC Will hit up NSF and DOE The SciDB Data Model Nothing (e.g. Hadoop, Pig, Hive, …)? Most Hadoop of you have schemas is not a good starting point Slow No HA The SciDB Data Model Tables? Makes Used a few of you happy by Sloan Sky Survey But PanStarrs scalability (Alex Szalay) wants arrays and The SciDB Data Model Arrays? Superset of tables (tables with a primary key are a 1-D array) Makes HEP, remote sensing, astronomy, oceanography folks happy But Not biology and chemistry (who wants networks and sequences) The SciDB Data Model Multidimensional Superset Makes grids of arrays (non-uniform cells) solid modeling folks happy But Complex and slow SciDB Data Model Nested Array multidimensional arrays values are a tuple of values and arrays Sightings (sid, details) [x, y, z, t] Objects (type, [sid]) [id] Basic Arrays Positive integer dimensions, no gaps Bounded or unbounded Enhanced Arrays “Shape” function Supports irregular boundary Enhanced Arrays Co-ordinate systems User defined functions that map integers to something else E.g. mercator Use dimension notation to access, e.g. A[17,36] or A{468.2, 917.6} SciDB Query Language “Parse-tree” With representation of array operations a “binding” to: MatLab C++ Python IDL There may be more…. User extendable operations (Postgres-style) Operations Standard relational ones (filter, join) Plus whatever you want (regrid, interpolate, fourier transform, eigenvalues, …) Plus We add your own (Postgres-style) need science input here!!! Environment and Storage Extendable With And grid (cloud) of Linux machines built-in high availability and failover built in disaster recovery In Situ Processing Operate on data with loading it Supported by a SciDB self-describing file format And some number of adaptors, e.g. HDF-5, NetCDF Or write your own Storage Model Arrays are “chunked” in storage Chunk Chunks Go size can vary are partitioned across the grid for scalability to petabytes Other Features Which Science Guys Want (These could be in RDBMS, but Aren’t) Uncertainty Data has error bars Which must be carried along in the computation (interval arithmetic) Will look at more sophisticated error models later Other Features Provenance (lineage) What calibration generated the data What was the “cooking” algorithm In general – repeatability of data derivation Supported by a command log with query facilities (interesting research problem) And redo Other Features Time travel Don’t fix errors by overwrite I.e. keep all of the data Supported by an extra array dimension (history) Spatial support Named versions Recalibration usually handled this way Supported by allocating an array for the new version and “diffing” against its parent Other Features (Optionally) integration of the real time data capture system “cooking” inside DBMS Makes provenance capture easier Sometimes important Time Line Q4/08 start Late 2009 Late company, begin research activities Demoware available 2010 V1 ships Project Organization (Build-it for real) CEO (Andy Palmer -- Vertica) Project CTO management (Bobbi Heath -- Vertica) (Stonebraker) Project Organization (Design and Research) Overall co-ordination (Stonebraker, DeWitt) Storage Query and execution (Madden, Cetintemal) layer and semantics (Zdonik, Maier) Provenance (Widom, Patel) Resource management (Balazinska) Language bindings (Carey) SciDB Has a Good Chance at Success Community realizes shared infrastructure is good “Lighthouse” Strong customers team Computation Easier And goes inside the DBMS to share reuse How Can You Help? Get involved!!!!