The rasdaman Array DBMS: an Overview

advertisement
The
rasdaman
Array DBMS:
an Overview
eSI 2010
Edinburgh
Peter Baumann
Jacobs University Bremen
Baumann :: eSi
p.baumann@jacobs-university.de
Roadmap
 Introduction
 Conceptual modelling
 Architecture
 Applications
 Wrap-up
Baumann :: eSi
p.baumann@jacobs-university.de
Raster Services: Differentiation
 multimedia databases
• Analyse images, then drop them
and work on auxiliary structure
 image processing
• Advanced processing of rasters,
but on main memory size objects
 image understanding,
Image processor
Raster database
high-level analysis
selection, data reduction
computer vision
• General recognition probabilistic
• databases deliver exact results
whenever possible
Baumann :: eSi
p.baumann@jacobs-university.de
Conceptual Modelling
Baumann :: eSi
p.baumann@jacobs-university.de
History
 Database view on raster images (e.g. [Meyer-Wegener 1989]):
• „raw image data consist of a matrix of pixels“,
but: „the raw data appear just as a string of bits“
• Focus on descriptive data („metadata“), neglecting images
Baumann :: eSi
p.baumann@jacobs-university.de
Modeling Sensor, Image, & Statistics Data
 This presentation based on [Baumann 1994]
• Formal semantics based on AFATL Image Algebra [Ritter et al 1990]
• Implementation: rasdaman (raster data manager)
 rasdaman: "raster data manager"
= middleware toolkit for high-volume n-D raster data
• SQL-embedded raster expressions; Java, C++
• Storage & query optimization
 rasql = retrieval & manipulation language
Baumann :: eSi
p.baumann@jacobs-university.de
Array Algebra Overview
 array = function: a: X→ F
(X n-D integer interval)
a = { (x,a(x)): x∈ X, a(x)∈ F }
cell
(spatial) domain
42
25
30
dimensions
 Core operations:
• array constructor
• Condenser
• Sorter
-- build array & initialize from cell expression
-- summarize over array, delivering a scalar
(using some commutative & associative summarization op)
-- slice array along a dimension, sort slices
 All else just shorthands: image addition, overlaying, statistics, ...
Baumann :: eSi
p.baumann@jacobs-university.de
Array Operations: MARRAY
 Array constructor: MARRAY ( e|x, X, x ) := { (x,f): f = e|x, x∈ X }
• for expression e|x
potentially containing occurrences of x, of result type F
 Example: image addition
addition of pixels!
• a + b := MARRAY( a[x] + b[x], X, x ) := { (x,f): f = a[x] + b[x], x∈ X }
 → shorthands:
unary and binary "induced" operations
• "whenever I have a pixel operation,
I automatically have the corresponding
image operation"
• Image addition, comparison, component access, ...:
a + b, a > b, a.green, ...
Baumann :: eSi
p.baumann@jacobs-university.de
Array Operations: COND
 Condenser: COND( e|a,x , o, X, x ) := e|a,p1 o e|a,p2 o ... o e|a,pn
• x visits each coordinate in X = { p , ..., p }
1
n
• e|a,pi expression potentially containing a and pi
• o commutative, associative
 Example: "Sum over all cell values"
• add(a) = COND( a[x], +, sdom(a), x )
= a[p1] + a[p2] + ... + a[pn]
Baumann :: eSi
p.baumann@jacobs-university.de
Why Commutative & Associative?
 Goal: declarative query language
• Declarative = express what you want, not how you get it
• Ex: select id from R where id < 10
...nothing about index usage, sequence,...
 Advantages:
• Database user doesn‘t have to care about details
• Optimiser gets liberty to (re-) organise query evaluation
 Example: tile-based processing:
≡
Baumann :: eSi
p.baumann@jacobs-university.de
From Algebra To Query Language
 Data model:
my_coll
(multi-) sets („collections“) of typed arrays
array array
OID
oid 1
 Data definition language rasdl [ODMG ODL]
• Parametrised array constructor
oid 2
oid 3
 Retrieval and manipulation language rasql [ISO SQL92]
• Set oriented, multidimensional operators
oid 4
oid 5
 Architecture streamlined towards piecewise processing of large objects
• Tile based
Baumann :: eSi
p.baumann@jacobs-university.de
The rasql Query Language
 selection & section
– select c[ *:*, 100:200, *:*, 42 ]
from
ClimateSimulations as c
 result processing
– select img * (img.green > 130)
from
LandsatArchive as img
 search & aggregation
– select mri
from
where
mri as img, masks as m
some_cells( mri > 250 and m )
 data format conversion
– select png( c[ *:*, *:*, 100, 42 ] )
from
Baumann :: eSi
ClimateSimulations as c
PNG
HDF
AVHRR
PNG
rasdaman
DB
HDF
AVHRR
p.baumann@jacobs-university.de
Application Example: Histogram
 Histogram of an n-D array over 8-bit unsigned integer:
• select marray n in [0:255]
values count_cells( image = n )
from
image
 changes cell type, dimension, domain
• sdom( Histogram(image) ) = [0:255]
Baumann :: eSi
p.baumann@jacobs-university.de
Wrap-Up: Modelling
 Algebraic basis
• Formal semantics definition
• Small set of basic operations
• derived operations for „syntactic sugar“
 Type definition for raster objects
 Query language, rasql, extends SQL with raster expressions
• Algebra-based, again
• Safe, declarative, optimisable (later...)
Baumann :: eSi
p.baumann@jacobs-university.de
Oracle 10g/11g

GeoRaster
• 2D Geo raster imagery
• Response to ESRI's ArcSDE 8

Functionality:
• Non-transparent image pyramids
• Subsetting, component extraction
• reprojection?
• No optimization clearly visible
declare
g sdo_georaster;
b blob;
begin
select raster into g
from uk_rasters
where id = 4;
dbms_lob.createTemporary(b,true);
sdo_geor.getRasterSubset(
georaster => g,
pyramidlevel => 0,
window =>
sdo_number_array(0,0,699,899),
bandnumbers => '0',
rasterBlob => b);
end;
select g.green[0:699,0:899]
from
uk_rasters as g
where oid(g) = 4
Baumann :: eSi
p.baumann@jacobs-university.de
Architecture I: Storage Mapping
Baumann :: eSi
p.baumann@jacobs-university.de
Storage Mapping
 Task:
• materialise finite interval X⊂ Zn, find suitable (disk) access structure
• Core structural property: Euclidean neighbourhood in Zn
• Secondary, contents/app based: data density („sparsity“), data pattern, access pattern
 Excursion: difference to arrays in main memory
• Ex: APL [Iverson 1968]
• Assumption 1:
access times independent from array position
• cost( „a[x]“ ) = const for all „x“
• Assumption 2:
access times independent from access sequence
• cost( „a[x];a[y]“ ) = 2*cost( „a[x]“) = const for all „x“, „y“
Baumann :: eSi
p.baumann@jacobs-university.de
Storage Mapping: Variants
 BLOB (binary large object)
• Coordinate free sequence
• Costs mainly position/dimension dependent
oooooooooooooooooooooooXXXXXXXX
oooooooooooooooooooooooXXXXXXXX
oooooooooooooooooooooooXXXXXXXX
ooooooooooooooooooooooooooooooo
oooooooooooooooooooooooXXXXXXXXoooooooooooooooooooooooXXXXXXXXoooooooooooooooooooooooXXXXXXXXoooooooooooooooooooooo
ooooooo
oooooooXXXXXoooooooooooooooooXXoooXoo
ooo
 Sequence independent, coordinates explicit
{ (x1,f1), (x2,f2), ..., (xn,fn) }
• Costs not position correlated, but high
• Sequence independent, coordinates explicit
 Imaging, multidimensional OLAP
• Partitioning, sequence within partition
• Costs low for bulk access, usually not location
correlated
Baumann :: eSi
p.baumann@jacobs-university.de
Tiled Array Storage
[Furtado 2000, Widmann 2001]
 multidimensional object
PostGIS Raster:
do it yourself
→ set of multidimensional tiles
• Tile = subarray
Blob in DBMS
• Also called: mosaicking [imaging, geo],
chunking [Sarawagi, DeWitt]
Index
 ...enables tile streaming
• optimal
evaluation sequence?
≡
SciDB: redundant tile overlap
Baumann :: eSi
p.baumann@jacobs-university.de
Benchmarks: Tiling Strategy
Operand: 3-D MDD object
Operation: Z cut
tomo_sliced 153x256
time:
selectivity: 1.6 %
tomo_cubed 32x32x32
time:
Baumann :: eSi
p.baumann@jacobs-university.de
Comparison: BLOB Read Performance
 Optimal tuning per system
 OS competitors often better!
70
MySQL ARCHIVE
60
1750
PostgreSQL, b=8k, l=2k
1500
time [msec]
50
time [msec]
2000
SystemA, CHUNK=8k
40
30
SystemB, p=16k
20
1250
1000
750
500
10
250
0
0
1k
2k
3.9k
BLOB siz e
Baumann :: eSi
MySQL ARCHIVE
PostgreSQL, b=8k, l=2k
SystemA, CHUNK=32k
SystemB, p=16k
5k
10k
50k
50k
100k
500k
1m
5m
10m
BLOB size
p.baumann@jacobs-university.de
Comparison: Time to Read (Deduced)
 performance varies
by two orders
of magnitude!
• @100K / MySQL
vs @10K / SystemB
[ms]
10000000
1000000
100000
10000
1000
100
10
1
1k
MySQL
Baumann :: eSi
2k
PostgreSQL
10k
100k
SystemA, CHUNK=8K
1m
10m
SystemB, p=16K
p.baumann@jacobs-university.de
Tiling Strategies
 Goal: faster tile loading by adapting storage units to access pattern
• When is tiling optimal?
 Tiling strategies [Furtado 1999]
regular
directional
area of interest
L1
L0
 Storage layout language [Baumann, Fezyabadi, Jucovschi 2010]
insert into MyCollection values ...
tiling area of interest [0:20,0:40],[45:80,80:85]
tile size 1000000
index d_index storage array compression zlib
Baumann :: eSi
p.baumann@jacobs-university.de
Adding Tertiary Storage
 tape archives for near-line access
[Sarawagi, Stonebraker 1994]
 Problem: respect spatial clustering
• Access locality (long positioning times!)
 Approach: super tiles = all tiles of
particular index node [Reiner 2001]
• Natural unit, comfortable to handle
Baumann :: eSi
p.baumann@jacobs-university.de
Architecture II: Query Processing
Baumann :: eSi
p.baumann@jacobs-university.de
Architecture
raslib / rasj
raster
rasql
alphanumeric
data
Client Communication Layer
Server Communication Layer
QL Parser Optimizer
Index
Executor
conventional base DBMS
Cache & TA Catalog
Base DBMS Interface
Baumann :: eSi
p.baumann@jacobs-university.de
Query Processing: Overview
 Parsing
 Normalisation
select a < avg_cells( b + c )
from
a, b, c
 Optimization
• Common subexpression elimination
 [Generate query plan]
 Tile-based evaluation
<ind
avg
a
+ind
b
Baumann :: eSi
c
p.baumann@jacobs-university.de
Benchmarks: Data Access
[Ritsch 2000, Widmann 2001]
100%
ttransport
tcpu
80%
tio
tindex
topt
60%
40%
ttransport
20%
tcpu
tio
tindex
0
50
200
350
500
650
800
950
1100
1250
1400
1550
1700
topt
1850 2000
#cells [1000] per MDD
Baumann :: eSi
p.baumann@jacobs-university.de
Benchmarks: Data Processing
50,00
45,00
40,00
35,00
30,00
25,00
20,00
15,00
10,00
5,00
0,00
NoIter, NoOps
[Ritsch 2000,
Widmann 2001]
Iter, Ops
Query Query Query Query Query Query
1
2
3
4
5
6
Baumann :: eSi
Query 1: access to 2-D object
Query 2: + 1 induced operation
Query 3: + 2 induced operations
Query 4: + 3 induced operations
Query 5: + 4 induced operations
Query 6: + 5 induced operations
p.baumann@jacobs-university.de
Optimisation Does Pay Off!
 Complex queries give more space to optimizer
 Typical OGC Web Map Service query:
select jpeg(
overlay
overlay
overlay
overlay
overlay
overlay
overlay
overlay
overlay
)
from ...
Baumann :: eSi
scale(bild0[...],[1:300,1:300])
((scale(bild1[...],[1:300,1:300])<71.0))
bit(scale(bild2[...],[1:300,1:300]), 2)
bit(scale(bild2[...],[1:300,1:300]), 5)
bit(scale(bild2[...],[1:300,1:300]), 7)
bit(scale(bild2[...],[1:300,1:300]), 6)
bit(scale(bild2[...],[1:300,1:300]), 3)
bit(scale(bild2[...],[1:300,1:300]), 4)
bit(scale(bild2[...],[1:300,1:300]), 1)
bit(scale(bild2[...],[1:300,1:300]), 0)
*
*
*
*
*
*
*
*
*
*
{ 1c, 1c, 1c}
{51c, 153c, 255c }
{230c, 230c, 204c}
{1c, 1c, 1c}
{102c, 102c, 102c}
{255c, 255c, 0c}
{191c, 242c, 128c}
{191c, 255c, 255c}
{0c, 255c, 255c}
{102c, 102c, 102c}
p.baumann@jacobs-university.de
Query Parallelisation
 easy: inter-query parallelization
(one client – one server process)
• Long-runners don't block service
• higher throughput
 Non-trivial: intra-query parallelization
(one client – several server processes)
[Hahn 2003]
• Idea: tiles dynamically assigned to processors
• Non-trivial array index patterns?
Baumann :: eSi
p.baumann@jacobs-university.de
Optimization 2:
Just-In-Time Compilation
 Approach:
• cluster suitable operations
• compile & link at runtime
 Benefit:
• Speed up complex, frequent queries
NO-OPT
eval times [ms] for 5122 * n ops
[Jucovschi 2008]
Baumann :: eSi
select x * x * ... * x
from float_dataset as x
p.baumann@jacobs-university.de
Query Optimization
Tile stream
high traffic
select avg_cells( a + b )
from
a, b
avg
+
≡
+ind
a
b
avg
avg
a
b
select avg_cells( a )
+ avg_cells( b )
from
a, b
Baumann :: eSi
Scalar stream
low traffic

understood:
heuristic optimization

partially understood:
cost-based optimization
p.baumann@jacobs-university.de
Raster Database Applications
Baumann :: eSi
p.baumann@jacobs-university.de
Sample WCS Based 3-D Service
DLR-DFD: eoweb.dlr.de [Diedrich et al 2001]
based on rasdaman
Baumann :: eSi
p.baumann@jacobs-university.de
WCPS
 Request yields one or more n-D coverages
 Abstract syntax (requests shipped as XML):
for var in ( coverageList )
[ where condition(var) ]
return processingExpr(var)
 Example:
for m in ( ModisA, ModisB, ModisC )
where
max( m.red ) > 127
return
encode( m.red + m.nir, "tiff" )
Baumann :: eSi
( tiff_A,
tiff_C )
p.baumann@jacobs-university.de
Climate Modelling
DKRZ: 24-node NEC SX-6
 Example: ECHAM T42 (cf. video)
• 50+ physical parameters („variables“): temperature,
wind speed x/y, humidity, pressure, CO2, ...
dimension
extent
• 2.5 TB per variable
 observation:
Huge volumes moved,
only part needed (10:1)
• [Kleese 2000]
Baumann :: eSi
Longitude
128
Latitude
64
Elevation
17
time (24 min
2,190,000
per time slice) (200 years)
p.baumann@jacobs-university.de
Cosmological Simulation
 Modelling domain: 4D
• Dark matter (highest mass factor in universe)
• Baryonic matter (stars, gas, dust, …)
•  Coupled simulation: particle + fluid
 Results: 3D/4D cutouts from universe
• Eg, 64 Mpc3
(Mega Parsec;
1 pc = 3.27 light years)
 Screenshots: AstroMD
[Gheller, Rossi 2001]
Baumann :: eSi
p.baumann@jacobs-university.de
Cosmology (contd.)

 Guided retrieval:
• Selection of objects  and their
attributes (cell components) 
• interactive setting of trim operations per
dimension 
• Augmented with induced operations 
 Suitable for expert users



 Details: cosmolab.cineca.it
Baumann :: eSi
p.baumann@jacobs-university.de
Human Brain Imaging
 Research goal: to understand structural-functional relations in human brain
 Experiments capture activity patterns (PET, fMRI)
• Temperature, electrical, oxygen consumption, ...
• → lots of computations → „activation maps“
 Example: “a parasagittal view of all scans containing
critical Hippocampus activations, TIFF-coded.“
select tiff( ht[ $1, *:*, *:* ] )
from
HeadTomograms as ht,
Hippocampus as mask
where count_cells( ht > $2 and mask )
/ count_cells( mask )
> $3
$1 = slicing position, $2 = intensity threshold value, $3 = confidence
Baumann :: eSi
p.baumann@jacobs-university.de
Gene Expression Analysis
http://urchin.spbcas.ru/Mooshka/ [Samsonova et al]
 Gene expression = reading out genes for reproduction
 Research goal: capture spatio-temporal expression patterns in Drosophila
genes
→
Baumann :: eSi
select jpeg( scale( {1c,0c,0c}*e[0,*:*,*:*]
+{0c,1c,0c}*e[1,*:*,*:*]
+{0c,0c,1c}*e[2,*:*,*:*], 0.2
) )
from EmbryoImages as e
where oid(e)=193537
p.baumann@jacobs-university.de
Summary: Domains Investigated
 Geo
• Environmental sensor data, 1-D
• Satellite / seafloor maps, 2-D
• Geophysics (3-D x/y/z)
• Climate modelling (4-D, x/y/z/t)
 Life science
• Gene expression simulation (3-D)
• Human brain imaging (3-D / 4-D)
 Other
• Computational Fluid Dynamics (3-D)
• Astrophysics (4-D)
Baumann :: eSi
p.baumann@jacobs-university.de
Wrap-Up
Baumann :: eSi
p.baumann@jacobs-university.de
The Big Picture

Large-scale raster services important +
growing field
• Currently driven by geo services
• Largely neglected challenge to
databases
• largest single DB objects ever!


Can translate most features from
alphanumeric databases (and benefit):
• Declarative, optimizable query language
• formal semantics definition
• Suitable storage architecture
≡
Service providers & users demand it
• "2D, 3D imagery next great challenge  Many open issues, such as:
• optimization
in geo databases" [Xavier Lopez,
Oracle]
• standardized benchmarks
• what expressive power?
• Web service & data integration
Baumann :: eSi
p.baumann@jacobs-university.de
Raster Services in the Semantic Web
 Machine/machine communication requires
machine-readable semantics definition
• Formal semantics
 Current raster services targeted at humans
• "portrayal" = surfing images, navigating maps
 Array Algebra / WCPS a basis for service reasoning
• Service discovery
• Service chaining / orchestration
• Distributed request processing (cloud computing!)
Baumann :: eSi
p.baumann@jacobs-university.de
Vision: Document Integrated Retrieval
2-D imagery
3-D volumes
2-D tables
1-D time series
„all clinical trials of drug X
where patient temperature > 40º C within the first 48 hours.“
Baumann :: eSi
p.baumann@jacobs-university.de
Download