Tore Risch
Department of Information Technology
Uppsala University, Sweden http://user.it.uu.se/~torer
Overload
Global information created and available storage
Exabytes
1
F O R E C A S T
Information created
750
Available storage
500
250
0
2005 06 07 08 09 10 11
2,000
1,750
1,500
1,250
1,000
Source: IDC
Need to process data directly in the streams
Data comes as large data streams, e.g.
- Satellite data
- Scientific instruments
- Patient monitoring
- Stock data
- Process industry
- Traffic control
Huge volumes of events representing measurements :
- particle type, mass, position
- time, position, temperature
- frequency spectrum at time t for receiver r
Data Base Management System
SQL Queries
DBMS
Query Processor
Data Manager
Meta – data
Stored
Data
Data Stream Management System
Data streams
Continuous Queries (CQs)
DSMS
Query Processor
Data & Stream Manager
Meta – data
Stored
Data
Data streams
http://www.lois-space.net/
– Very high data volume and rate
– Complex numerical data
– Relatively expensive user-defined computations
-13000 antennas
-Distributed over 100 stations
-Producing ~20Tbps raw data
UU: Developing a scalable DSMS to process LOFAR stream queries
• Scalable for very high data volume
• Cannot save data -> Continuous queries
(CQs)
• Queries over large moving windows of streams
• Filtering , reduction, combining of data streams
• User-defined CQ computations
• Scalable for expensive computations too.
• A stream data manager for very high volume scientific streams
• Define stream computations over data windows through continuous queries
• User Defined Functions called in queries allows customized computations
• Distributed and dynamic architecture
• Data Flow Distribution Templates allow customized scalable distributed computation models
• Main memory stream query engine
• VLDB 2005
Tree-structured partitioning and combining
set wd-tree =
PCC(2,"S-Distribute","RRpart",
"PCC",(2,"S-Distribute,
"RRpart","fft3",
"S-Merge",0.1),
"S-Merge",0.1);
• Scalable search and processing of very high-volume data produced by scientific instruments on massively parallel computers
• Runs on many platforms
– Linux clusters (IBM, Dell)
– IBM BlueGene
– Windows
• User defined parallelization through second order parallelization functions in queries
• Dynamic parallelization during query execution
• Application specific parallelization
• Scalable customizable data flow split strategies
• Presented at DASFAA 2010, EDBT 2008, DEPSA 2007
Input streams
Back-end cluster
Blue
Gene
Front cluster
User
www.cs.brandeis.edu/~linearroad
Domain specific parallelization functions splitstream(s, integer 10, #’rfnD’, #’bfnD’); create function rfnD(Event e, Integer w)->Integer i as select i where (eventtype(e)<3 and i=0) or
(eventtype(e)=3 and i=1); create function bfnD(Event e, Integer w)-> Boolean as eventtype(e)=2;
Produces large streams of particle collision events
– Generate new particles
Problem: Find interesting particles produced by collisions
– E.g. Higgs bosons
Recorded in large # of large binary files
– C++ framework ROOT
Scientists develop filters called cuts
– Matched against streams of collision events
– Collision event: complex object From http://www.atlas.ch/etours_exper/
POQSEC provides
– scientific query management
Grid provides
– computation management
– file management
NorduGrid Middleware
Application area provides
– computational libraries
– data management libraries
ROOT library
POQSEC
ROOT NorduGrid
Data Clusters
ATLAS
Domain
Ontology: select e from Event e, EventFile f where name(experiment(f)) = “bkg2” and fileid(f) < 15 and e in events(filename(f)) and hadrtopcut(e) and jetvetocut(e) and misseecuts(e) and zvetocut(e) and threeleptoncut(e) and leptoncuts(e);
Publications: SSDBM2009, SSDBM2008
Selection
Cuts
• Space physics
– Search, analyze, combine huge streams of data from space
– Look for transients
– Spatial analysis
– Computations over time windows
=> Scalable streamed data processing
• Particle physics
– Search and analyze huge volumes of complex objects (databases) describing particle collisions
– Execute filters identifying that certain particles produced by collision
– Look for the ultimate needle in the haystack,
Higgs Boson
Scalable data & query processing over streams of complex objects
• Scientific data processing software, e.g.
– Database Management Systems
– Search engines
– Statistical processing software
– Numerical processing software
– Monitoring software
– Visualization software
– Documentation systems
• To run in distributed/parallel infrastructure
– PCs
– Clusters
– The Grid
– Clouds
– GPUs
• Need for scalable processing over large data streams:
– Scalable search
– Scalable computations
– Scalable statistical analyses
• Enabling technologies to develop further:
– Massive parallelism and distribution
– Very high level declarative specifications
– Stream query processing
– Streamed indexing, inferences, data mining
– (Inferred) meta-data specification
– Management of heterogeneous data
Thank you
For your attention
Tore Risch http://user.it.uu.se/~torer