SciDB An Open Source Data Base Project by Michael Stonebraker

SciDB An Open Source Data Base Project by Michael Stonebraker (and others) Outline Why science folks are unhappy with RDBMS How we plan to fix that The details Why SciDB? “Big science” very unhappy with RDBMS  Astronomy  HEP  Fusion  Bio  Remote sensing Why? Experience Tried to use Postgres for science databases Failed  of Sequoia 2000 (mid 1990s) badly…… Main science data type is an array – horribly inefficient to simulate arrays on top of tables  Required features absent (provenance, uncertainty, version control)  SQL operations wrong (regrid – not join) Why SciDB? Net  result Mentality of “roll your own from the ground up” for every new science project  Realization by the science community that this is long-term suicide Community wants to get behind something better  Great commonality of needs among domains A Little Context XLDB-1  Genesis Asilomar  Small of the need conference (March 2008) conference to generate requirements A Little Context March 2008 – September 2008  Initial design completed  Fund raising  Recruiting of initial team  Detailed use cases specified Our Partnership Science and high-end commercial folks  Who will put up some resources  And review design DBMS  brain trust Who will design the system, oversee its construction, and perform needed research Non-profit company  Which will manage the open source project  And support the resulting system  May need long term funding help Partners – Science (We are recruiting more….) LSST  astronomy project DBMS work co-ordinated by SLAC Pacific  Northwest National Laboratory (PNNL) Various bio projects Lawrence  Livermore National Laboratory Fusion projects UCSB  Remote sensing Partners -- DBMS Mike Stonebraker (MIT) Dave DeWitt (Wisconsin -> Microsoft) Jignesh Patel (Wisconsin) Jennifer Widom (Stanford) Dave Maier (Portland State) Stan Zdonik (Brown) Sam Madden (MIT) Ugur Cetintemal (Brown) Magda Mike Balazinska (Washington) Carey (UCI) Partners -- Other E-Bay Vertica Microsoft LSST SLAC Will hit up NSF and DOE The SciDB Data Model Nothing (e.g. Hadoop, Pig, Hive, …)?  Most Hadoop of you have schemas is not a good starting point  Slow  No HA The SciDB Data Model Tables?  Makes  Used a few of you happy by Sloan Sky Survey But  PanStarrs scalability (Alex Szalay) wants arrays and The SciDB Data Model Arrays?  Superset of tables (tables with a primary key are a 1-D array)  Makes HEP, remote sensing, astronomy, oceanography folks happy But  Not biology and chemistry (who wants networks and sequences) The SciDB Data Model Multidimensional  Superset  Makes grids of arrays (non-uniform cells) solid modeling folks happy But  Complex and slow SciDB Data Model Nested Array multidimensional arrays values are a tuple of values and arrays Sightings (sid, details) [x, y, z, t] Objects (type, [sid]) [id] Basic Arrays Positive integer dimensions, no gaps Bounded or unbounded Enhanced Arrays “Shape” function  Supports irregular boundary Enhanced Arrays Co-ordinate systems  User defined functions that map integers to something else  E.g.  mercator Use dimension notation to access, e.g.  A[17,36] or  A{468.2, 917.6} SciDB Query Language “Parse-tree” With representation of array operations a “binding” to:  MatLab  C++  Python  IDL  There may be more…. User extendable operations (Postgres-style) Operations Standard relational ones (filter, join) Plus whatever you want (regrid, interpolate, fourier transform, eigenvalues, …) Plus We add your own (Postgres-style) need science input here!!! Environment and Storage Extendable With And grid (cloud) of Linux machines built-in high availability and failover built in disaster recovery In Situ Processing Operate on data with loading it Supported by a SciDB self-describing file format  And some number of adaptors, e.g. HDF-5, NetCDF  Or write your own Storage Model Arrays are “chunked” in storage  Chunk Chunks Go size can vary are partitioned across the grid for scalability to petabytes Other Features Which Science Guys Want (These could be in RDBMS, but Aren’t) Uncertainty  Data has error bars  Which must be carried along in the computation (interval arithmetic)  Will look at more sophisticated error models later Other Features Provenance (lineage)  What calibration generated the data  What was the “cooking” algorithm  In general – repeatability of data derivation Supported by a command log  with query facilities (interesting research problem)  And redo Other Features Time travel  Don’t fix errors by overwrite  I.e. keep all of the data  Supported by an extra array dimension (history) Spatial support Named versions  Recalibration usually handled this way  Supported by allocating an array for the new version and “diffing” against its parent Other Features (Optionally) integration of the real time data capture system  “cooking” inside DBMS  Makes provenance capture easier  Sometimes important Time Line Q4/08  start Late 2009  Late company, begin research activities Demoware available 2010  V1 ships Project Organization (Build-it for real) CEO (Andy Palmer -- Vertica) Project CTO management (Bobbi Heath -- Vertica) (Stonebraker) Project Organization (Design and Research) Overall co-ordination (Stonebraker, DeWitt) Storage Query and execution (Madden, Cetintemal) layer and semantics (Zdonik, Maier) Provenance (Widom, Patel) Resource management (Balazinska) Language bindings (Carey) SciDB Has a Good Chance at Success Community realizes shared infrastructure is good “Lighthouse” Strong customers team Computation  Easier  And goes inside the DBMS to share reuse How Can You Help? Get involved!!!!

SciDB An Open Source Data Base Project by Michael Stonebraker

Related documents

Products

Support

SciDB An Open Source Data Base Project by Michael Stonebraker

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib