SciDB An Open Source Data Base Project by Michael Stonebraker

advertisement
SciDB
An Open Source Data Base Project
by
Michael Stonebraker
(and others)
Outline
Why
science folks are unhappy with RDBMS
How
we plan to fix that
The
details
Why SciDB?
“Big
science” very unhappy with RDBMS

Astronomy

HEP

Fusion

Bio

Remote sensing
Why?
Experience
Tried
to use Postgres for science databases
Failed

of Sequoia 2000 (mid 1990s)
badly……
Main science data type is an array – horribly
inefficient to simulate arrays on top of tables

Required features absent (provenance, uncertainty,
version control)

SQL operations wrong (regrid – not join)
Why SciDB?
Net

result
Mentality of “roll your own from the ground up” for
every new science project

Realization by the science community that this is
long-term suicide
Community
wants to get behind something
better

Great commonality of needs among domains
A Little Context
XLDB-1
 Genesis
Asilomar
 Small
of the need
conference (March 2008)
conference to generate requirements
A Little Context
March
2008 – September 2008

Initial design completed

Fund raising

Recruiting of initial team

Detailed use cases specified
Our Partnership
Science
and high-end commercial folks

Who will put up some resources

And review design
DBMS

brain trust
Who will design the system, oversee its
construction, and perform needed research
Non-profit
company

Which will manage the open source project

And support the resulting system

May need long term funding help
Partners – Science
(We are recruiting more….)
LSST

astronomy project
DBMS work co-ordinated by SLAC
Pacific

Northwest National Laboratory (PNNL)
Various bio projects
Lawrence

Livermore National Laboratory
Fusion projects
UCSB

Remote sensing
Partners -- DBMS
Mike
Stonebraker (MIT)
Dave
DeWitt (Wisconsin -> Microsoft)
Jignesh
Patel (Wisconsin)
Jennifer
Widom (Stanford)
Dave
Maier (Portland State)
Stan
Zdonik (Brown)
Sam
Madden (MIT)
Ugur
Cetintemal (Brown)
Magda
Mike
Balazinska (Washington)
Carey (UCI)
Partners -- Other
E-Bay
Vertica
Microsoft
LSST
SLAC
Will
hit up NSF and DOE
The SciDB Data Model
Nothing
(e.g. Hadoop, Pig, Hive, …)?
 Most
Hadoop
of you have schemas
is not a good starting point
 Slow
 No
HA
The SciDB Data Model
Tables?
 Makes
 Used
a few of you happy
by Sloan Sky Survey
But
 PanStarrs
scalability
(Alex Szalay) wants arrays and
The SciDB Data Model
Arrays?
 Superset
of tables (tables with a primary
key are a 1-D array)
 Makes
HEP, remote sensing, astronomy,
oceanography folks happy
But
 Not
biology and chemistry (who wants
networks and sequences)
The SciDB Data Model
Multidimensional
 Superset
 Makes
grids
of arrays (non-uniform cells)
solid modeling folks happy
But
 Complex
and slow
SciDB Data Model
Nested
Array
multidimensional arrays
values are a tuple of values and arrays
Sightings (sid, details) [x, y, z, t]
Objects (type, [sid]) [id]
Basic Arrays
Positive
integer dimensions, no gaps
Bounded
or unbounded
Enhanced Arrays
“Shape”
function
 Supports
irregular boundary
Enhanced Arrays
Co-ordinate
systems
 User
defined functions that map integers to
something else
 E.g.

mercator
Use dimension notation to access, e.g.
 A[17,36]
or
 A{468.2,
917.6}
SciDB Query Language
“Parse-tree”
With
representation of array operations
a “binding” to:

MatLab

C++

Python

IDL

There may be more….
User
extendable operations (Postgres-style)
Operations
Standard
relational ones (filter, join)
Plus
whatever you want (regrid, interpolate,
fourier transform, eigenvalues, …)
Plus
We
add your own (Postgres-style)
need science input here!!!
Environment and Storage
Extendable
With
And
grid (cloud) of Linux machines
built-in high availability and failover
built in disaster recovery
In Situ Processing
Operate
on data with loading it
Supported
by a SciDB self-describing file
format
 And
some number of adaptors, e.g. HDF-5,
NetCDF
 Or
write your own
Storage Model
Arrays
are “chunked” in storage
 Chunk
Chunks
Go
size can vary
are partitioned across the grid
for scalability to petabytes
Other Features
Which Science Guys Want
(These could be in RDBMS, but Aren’t)
Uncertainty

Data has error bars

Which must be carried along in the computation
(interval arithmetic)

Will look at more sophisticated error models later
Other Features
Provenance
(lineage)

What calibration generated the data

What was the “cooking” algorithm

In general – repeatability of data derivation
Supported
by a command log

with query facilities (interesting research problem)

And redo
Other Features
Time
travel

Don’t fix errors by overwrite

I.e. keep all of the data

Supported by an extra array dimension (history)
Spatial
support
Named
versions

Recalibration usually handled this way

Supported by allocating an array for the new
version and “diffing” against its parent
Other Features
(Optionally)
integration of the real time data
capture system

“cooking” inside DBMS

Makes provenance capture easier

Sometimes important
Time Line
Q4/08
 start
Late
2009

Late
company, begin research activities
Demoware available
2010

V1 ships
Project Organization
(Build-it for real)
CEO
(Andy Palmer -- Vertica)
Project
CTO
management (Bobbi Heath -- Vertica)
(Stonebraker)
Project Organization
(Design and Research)
Overall
co-ordination (Stonebraker, DeWitt)
Storage
Query
and execution (Madden, Cetintemal)
layer and semantics (Zdonik, Maier)
Provenance
(Widom, Patel)
Resource
management (Balazinska)
Language
bindings (Carey)
SciDB Has a Good Chance at Success
Community
realizes shared infrastructure is
good
“Lighthouse”
Strong
customers
team
Computation
 Easier
 And
goes inside the DBMS
to share
reuse
How Can You Help?
Get
involved!!!!
Download