Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova

advertisement
Martin Kersten
Milena Ivanova
Scientific Databases:
the story behind the scenes
M.Kersten Mar 2010
DIR Edinburgh
Departure for a journey
• CWI Database Architecture Group
• Core business:
• To research efficient and effective database technology
• To deploy this technology in real-life application settings
• To disseminate this knowledge as open-source software
• Key research issues
• What is the ultimate (virtual) machine architecture and
software stack for database processing?
M.Kersten Mar 2010
DIR Edinburgh
The Big Data Bang
M.Kersten Mar 2010
DIR Edinburgh
Outline
• Departure for a journey
• Mapping unknown territory
• Crossing the Great Divide
• Stepping stone 1: Multimedia Dimension
• Stepping stone 2: Geometric Dimension
• Stepping stone 3: Lineage Dimension
• Stepping stone 4: Heterogeneous Databases
• Stepping stone 5: Semantic Search
• Stepping stone 6: Wireless sensor databases
• Stepping stone 7: Distributed Databases
• Arrival and outlook
• SciDB and SciLens ambitions
• Teaming up and making it a success
M.Kersten Mar 2010
DIR Edinburgh
M.Kersten Mar 2010
DIR Edinburgh
230 million object images
1 million spectra
4TB catalog data
9TB images
A project to make a map of
a large part of the Universe
SkyServer provides public
access to SDSS
for astronomers, students,
and wide public
M.Kersten Mar 2010
DIR Edinburgh
SkyServer Schema
446 columns
>370 million rows
Vertical fragment of 100+
popular columns
Materialized join of
Photo and Spectra
M.Kersten Mar 2010
DIR Edinburgh
Initial exploration
M.Kersten Mar 2010
DIR Edinburgh
Initial exploration
M.Kersten Mar 2010
DIR Edinburgh
Mapping unknown territory
Astronomy
Geophysics Biosciences Neuroscience
…
Modelling (Atlas)
…
Annotations
…
Features Space
…
Geometric Mapping
…
Multimedia Images
…
M.Kersten Mar 2010
DIR Edinburgh
Mega scale
One size fits all?
Oracle
MS SQLserver
DB2
Pico scale
Vertica
MonetDB
Postgresql
Mysql, MariaDB
SQLite
Structured
M.Kersten Mar 2010
NoSQL
MongoDB
LucidDB
semi-structure
DIR Edinburgh
documents images
We have to stand the storm
M.Kersten Mar 2010
DIR Edinburgh
Stepping stone 1: Multimedia
Dimension
• Storage challenges:
• Large volumes (>Tbyte, >Pbyte) of raw data
• Partitioning based on image, video segmentation
• Indexing based on feature vectors
• Query challenges:
• Proximity and probability based search
• CPU intensive, user defined predicates
• Content-based information retrieval
M.Kersten Mar 2010
DIR Edinburgh
Stepping stone 1: Multimedia Dimension
• The database consists of 100.000 images.
• From each image we extract 25 patches
• For each patch a 14-dimensional feature vector is derived
2.500.000 images
• Challenge, find similar images based on Euclidian distance
with sub-second response time.
• Solution, novel database algorithms to solve K-nearest
neighbours (k-NN) search
• Lessons: start from generative models.
M.Kersten Mar 2010
DIR Edinburgh
Stepping stone 1: Multimedia Dimension
•
Alternative scheme, determine the probability that an image can
be generated with a limited number of Guassian mixtures
•
Fix a limited number of GMM and use an Expectation
Maximization algorithm to fit the model over the image
•
Search similar images by comparison of the GMM model
parameters
M.Kersten Mar 2010
DIR Edinburgh
Probabilistic Image Dimension
• Query:
• Which of the models is most likely to generate these 24
samples?
M.Kersten Mar 2010
DIR Edinburgh
Probabilistic Image Dimension
?
M.Kersten Mar 2010
DIR Edinburgh
Stepping stone 2: Geometric
Dimension
• Any geometric abstraction of reality provides a good
navigational map
• Database storage and indexing support for 2D is mature
• R-trees and Quad-trees
• Commercial database vendors do ‘not like them’
• Open research issue is to support 2D query embedding
• Scaling out towards 3-, 4-, dimensions and temporal support
• Examples: researched extensively in Geographical
Information Systems. Google-map is omnipresent or
openGIS
• Lessons: avoid abundance of reference models, baroque
datastructures
not necessarily
scale
M.Kersten
Mar 2010
DIR Edinburgh
Stepping stone 3: Lineage Dimension
• The problem encountered in many scientific databases is to
ensure data lineage, the ability to travel back in time to
understand, redo and judge the derivations.
• How to keep track of the complete context?
• Data, software, parameter settings,…
• How to redo part of the analysis ?
• How to store and remember the lineage trails?
• Example: AstroWise project in Groningen keeps track of a
complete workflow for telescope data analysis in a large
Oracle database. All derivations are 5-line python programs.
• Lesson: don’t be afraid for storage cost, be an accountant
M.Kersten Mar 2010
DIR Edinburgh
Stepping stone 4: Heterogenous
Databases
• A key problem is to share heterogeneous information
• Use commonly approved vocabulary and standard syntax
• XML is the language inter-galactica for self-descriptive
data and its exchange between software systems
• RDF claims to be the next king
• The database community was actively working on XML,
XQuery, and Xupdate database engines, but it is not easy !
• Challenges, how to scale to large XML stores ? How to
efficiently search components? How to realize structural
information retrieval?
• RDF world brings in graph-algorithms
• Lessions: science is done, jewels are captured by bandits
M.Kersten Mar 2010
DIR Edinburgh
Database and Informatics Working Group
FBIRN 2005 – David Keator
MR scanner
“big picture”
fBIRN
pipeline
XML-based
events file
event
analysis
scanner- or
software-specific
file formats
XML-based
image header
image preprocessing
M.Kersten Mar 2010
DIR Edinburgh
Stepping stone 5: Semantic search
• Ontology integration is one of the most pressing challenges
for the semantic web to take off.
• Integration of technology with databases is still immature.
• RDF and OWL are the leading paradigms, SPARQL is an
attempt to bridge the gap between traditional database
management and semantic web technology.
• Lessons: not a technological issue, but an educational and
cultural issues
• http://e-culture.multimedian.nl/demo/search
M.Kersten Mar 2010
DIR Edinburgh
Stepping stone 6: Sensor Databases
• Database management functionality can be downscaled to
the level of small sensor-enabled devices. They can form adhoq networks and provide a straightforward SQL interface
for aggregation. The focus is on network based aggregation
under severe energy limitations .
• Embedded database systems are not up to the job. Positive
case studies include TinyDB on TinyOS (Berkeley)
• The DataCell project at CWI ( and Philips) aims to provide
for a more expressive query language and application
interface.
M.Kersten Mar 2010
DIR Edinburgh
Research World Perspective
Past
Future
sensor cluster
mobile
Semantic Sensors
mobile
sensor cluster
integrated
management
distributed
management
stationary
distributed
PC-less
sensor net
AmbientDB
sensor net
M.Kersten Mar 2010
DIR Edinburgh
Stepping stone 7: MR/DDBMS
• HPC … Grids …. Clouds …
• Grids are focussed on high-performance computing with a
focus on Authentication-Authorization-Access and data
shipping over wide-area networks.
• Map-reduce technology is a re-invention of re-scaled
distributed database technology and distributed
programming.
• Data distribution, replication, and parallel query processing
is well studied over the last 3 decades !!
• Lessions: application programmers are infected by “notwritten-by-me” hype bacteria
M.Kersten Mar 2010
DIR Edinburgh
MonetDB in the large
• MonetDB/Map-reduce
• Pure map-reduce approach driven by query streams
leading to self-organising distributed database.
• MonetDB/Octopus
• Dynamic partial replication of databases with economic
model for reallocation and recycler technology
• MonetDB/Datacyclotron
• Let the database hotset flow like a stream or particles
through a large and fast ring-connected machines, e.g. a
data collider
M.Kersten Mar 2010
DIR Edinburgh
Get our hands dirty
Toys
Tools
&
Techniques
M.Kersten Mar 2010
DIR Edinburgh
The MonetDB product family
End-user application
SQL
JDBC
ODBC
XQuery
Python
Perl
C-mapi lib
MAPI protocol
MonetDB
kernel
PHP
RoR
The MonetDB Software Stack
XQuery
SQL 03
Optimizers
SOAP
MonetDB 4
MonetDB 5
MonetDB kernel
compile
An advanced column-oriented DBMS
M.Kersten Mar 2010
DIR Edinburgh
SQL/XML
Open-GIS
GIS
The MonetDB Software Stack
SQL 03
Optimizers
MonetDB 5
Orthogonal extension of SQL03
Clear computational semantics
Extensions
Minimal extension to MonetDB
MonetDB kernel
An advanced column-oriented DBMS
MonetDB Recycler Architecture
function user.s1_2(A0:date, ...):void;
X5 := sql.bind("sys","lineitem",...);
X10 := algebra.select(X5,A0);
X12 := sql.bindIdx("sys","lineitem",...);
X15 := algebra.join(X10,X12);
X25 := mtime.addmonths(A1,A2);
...
function user.s1_2(A0:date, ...):void;
X5 := sql.bind("sys","lineitem",...);
X10 := algebra.select(X5,A0);
X12 := sql.bindIdx("sys","lineitem",...);
X15 := algebra.join(X10,X12);
X25 := mtime.addmonths(A1,A2);
...
SQL
MAL
Tactical Optimizer
Recycler
Optimizer
MAL
MonetDB Kernel
Run-time Support
Admission & Eviction
MonetDB
Server
30/06/2009 SIGMOD'09
Providence, RI
XQuery
Recycle Pool
An Architecture for Recycling
Intermediates
M. Ivanova, M. L.
32/20
SciDB and SciLens projects
• Design and implement a database management system better
geared at the requirements of scientific applications
• SciDB vision (http://www.scidb.org)
• Array datamodel is missing
• Distributed, map-reduce processing from the start
• No-cost loading of data
• … redo all the hard work from the ground up
• SciLens
• Multi-paradigm software layer
• Database summarisation is the key
• … build on the shoulders of the MonetDB team
M.Kersten Mar 2010
DIR Edinburgh
Teaming up and making it a success
Crossing the Great Divide is challenging and rewarding iff
• Building the bridge starts from both ends
• Parties recognize and respect each others core business
Open-source database technology provides a sound basis to manage
sizeable scientific databases
• To capitalize and steer expertise development
The database community can provide knowledge on modelling, query
processing, algorithms, data structures, scalability, persistency, …and
flexible database systems
The MonetDB team seeks new frontiers in scalable structured database
management
M.Kersten Mar 2010
DIR Edinburgh
M.Kersten Mar 2010
DIR Edinburgh
Download