Challenges of streaming data Tore Risch Department of Information Technology Uppsala University, Sweden

advertisement

Challenges of streaming data

Tore Risch

Department of Information Technology

Uppsala University, Sweden http://user.it.uu.se/~torer

Too much data to store on disk

Overload

Global information created and available storage

Exabytes

1

F O R E C A S T

Information created

750

Available storage

500

250

0

2005 06 07 08 09 10 11

2,000

1,750

1,500

1,250

1,000

Source: IDC

Need to process data directly in the streams

New data intensive applications

Data comes as large data streams, e.g.

- Satellite data

- Scientific instruments

- Patient monitoring

- Stock data

- Process industry

- Traffic control

Huge volumes of events representing measurements :

- particle type, mass, position

- time, position, temperature

- frequency spectrum at time t for receiver r

Data Base Management System

SQL Queries

DBMS

Query Processor

Data Manager

Meta – data

Stored

Data

Data Stream Management System

Data streams

Continuous Queries (CQs)

DSMS

Query Processor

Data & Stream Manager

Meta – data

Stored

Data

Data streams

http://www.lois-space.net/

– Very high data volume and rate

– Complex numerical data

– Relatively expensive user-defined computations

The LOFAR Instrument

-13000 antennas

-Distributed over 100 stations

-Producing ~20Tbps raw data

UU: Developing a scalable DSMS to process LOFAR stream queries

LOIS/LOFAR requirements for data processing

• Scalable for very high data volume

• Cannot save data -> Continuous queries

(CQs)

• Queries over large moving windows of streams

• Filtering , reduction, combining of data streams

• User-defined CQ computations

• Scalable for expensive computations too.

Stream Database Manager GSDM

• A stream data manager for very high volume scientific streams

• Define stream computations over data windows through continuous queries

• User Defined Functions called in queries allows customized computations

• Distributed and dynamic architecture

• Data Flow Distribution Templates allow customized scalable distributed computation models

• Main memory stream query engine

• VLDB 2005

Composed data flow templates

Tree-structured partitioning and combining

:

set wd-tree =

PCC(2,"S-Distribute","RRpart",

"PCC",(2,"S-Distribute,

"RRpart","fft3",

"S-Merge",0.1),

"S-Merge",0.1);

SCSQ - SuperComputer Stream Query processor

• Scalable search and processing of very high-volume data produced by scientific instruments on massively parallel computers

• Runs on many platforms

– Linux clusters (IBM, Dell)

– IBM BlueGene

– Windows

• User defined parallelization through second order parallelization functions in queries

• Dynamic parallelization during query execution

• Application specific parallelization

• Scalable customizable data flow split strategies

• Presented at DASFAA 2010, EDBT 2008, DEPSA 2007

Stream computations on several clusters:

Input streams

Back-end cluster

Blue

Gene

Front cluster

User

SCSQ components:

www.cs.brandeis.edu/~linearroad

Linear road performance

Parallelizing expensive stream computations

SCSQ query parallelization

Domain specific parallelization functions splitstream(s, integer 10, #’rfnD’, #’bfnD’); create function rfnD(Event e, Integer w)->Integer i as select i where (eventtype(e)<3 and i=0) or

(eventtype(e)=3 and i=1); create function bfnD(Event e, Integer w)-> Boolean as eventtype(e)=2;

ATLAS Application Data

Produces large streams of particle collision events

– Generate new particles

Problem: Find interesting particles produced by collisions

– E.g. Higgs bosons

Recorded in large # of large binary files

– C++ framework ROOT

Scientists develop filters called cuts

– Matched against streams of collision events

– Collision event: complex object From http://www.atlas.ch/etours_exper/

Software layers

POQSEC provides

– scientific query management

Grid provides

– computation management

– file management

NorduGrid Middleware

Application area provides

– computational libraries

– data management libraries

ROOT library

POQSEC

ROOT NorduGrid

Data Clusters

Scientific DSMS SQISLE

ATLAS

Domain

Ontology: select e from Event e, EventFile f where name(experiment(f)) = “bkg2” and fileid(f) < 15 and e in events(filename(f)) and hadrtopcut(e) and jetvetocut(e) and misseecuts(e) and zvetocut(e) and threeleptoncut(e) and leptoncuts(e);

Publications: SSDBM2009, SSDBM2008

Selection

Cuts

Challenging database research

• Space physics

– Search, analyze, combine huge streams of data from space

– Look for transients

– Spatial analysis

– Computations over time windows

=> Scalable streamed data processing

• Particle physics

– Search and analyze huge volumes of complex objects (databases) describing particle collisions

– Execute filters identifying that certain particles produced by collision

– Look for the ultimate needle in the haystack,

Higgs Boson

Scalable data & query processing over streams of complex objects

Diverse Distributed Software

• Scientific data processing software, e.g.

– Database Management Systems

– Search engines

– Statistical processing software

– Numerical processing software

– Monitoring software

– Visualization software

– Documentation systems

• To run in distributed/parallel infrastructure

– PCs

– Clusters

– The Grid

– Clouds

– GPUs

Important research direction

• Need for scalable processing over large data streams:

– Scalable search

– Scalable computations

– Scalable statistical analyses

• Enabling technologies to develop further:

– Massive parallelism and distribution

– Very high level declarative specifications

– Stream query processing

– Streamed indexing, inferences, data mining

– (Inferred) meta-data specification

– Management of heterogeneous data

Thank you

For your attention

Tore Risch http://user.it.uu.se/~torer

Download