Research Questions in Streaming Relational Grid Information Service Beth Plale

advertisement
Research Questions in
Streaming Relational Grid
Information Service
Beth Plale
Computer Science Dept.
Indiana University
24 April 2003
Grid Information Services Workshop
Challenge:
• Grid information service by its nature
maintains highly dynamic data.
• Requirement creates competing client
demands of freshness of data and fast
query response times.
• Good solutions emerging but more
research needed to fully explore space.
24 April 2003
Grid Information Services Workshop
Definitions:
• Grid information service – repository of
information about resources on grid
– ‘Resource’ very broadly defined
– Resource described by set of attributes.
– Attributes can be static or dynamic
• Dynamic attributes
– All attributes are dynamic to some degree
– We consider an attribute to be dynamic if it
could be refreshed at or faster than once per
several seconds
24 April 2003
Grid Information Services Workshop
How do we know fresh results
and fast query response times
are competing needs in grid
information services?
• J. Schopf HPDC 2003 paper
• W. Smith, HPDC 1999 paper
• Our work on synthetic database
workload
24 April 2003
Grid Information Services Workshop
Synthetic Database Workload
• Synthetic database benchmark - consists of
– 16 queries and updates
• Carefully chosen to exercise broad aspects of platform
– 4 scenarios
• ‘Scenario’ - controlled workload of queries
and updates. Exposes query response time
sensitivity to update workload.
• Started from ER diagram (or UML)
– 34 tables/collections/object classes
– Based on Glue v8
– Extended with users, user accounts, logical
connections
24 April 2003
Grid Information Services Workshop
Synthetic Database Workload, cont.
• Data
– 81,684 tuples/documents/objects per database
– Approx. 10MB (relational representation)
• Target platforms
– mySQL v4.0, Xindice v1.1, MDS GT 2.2
• Evaluating against two metrics:
– Query response time
– Ease of use
• Expressiveness of query language
• Amount of information returned to user
24 April 2003
Grid Information Services Workshop
Our observations add to
emerging body of knowledge
that for all three platforms
tested, query response time
deteriorates noticeably under
rapid update rates
24 April 2003
Grid Information Services Workshop
Approach to streamingRGR
• Assumption: relational solution is viable
solution for a grid information service.
• Observation: place to look for most recent
data is in data streams.
• Approach: query database for static data, but
pull dynamic data directly from data stream
• If query is investigative (i.e. exploratory), then
must support query of database for results
that are ‘good enough’ .
24 April 2003
Grid Information Services Workshop
Skeleton
architecture
of approach
client
SQL query
Query rewrite and results composition (very thin layer)
‘static’ portion
of query
Relational database
‘dynamic’ portion
of query
Controlled
updates
Stream
query
Monitoring service (ganglia, NWS, Remos, RGM-A,vmstat)
Distributed resources (hosts, clusters, switches, routers)
24 April 2003
Grid Information Services Workshop
Research questions to combining
streaming/database solution
• Research question 1: Differing models:
– Database:
• Queries arrive at database,
• Queries issued once, and
• Issued against relatively stable set of tuples
– Streams:
• Events arrive at database,
• Queries are long lived, and
• Issued upon event arrival against sliding window of data.
24 April 2003
Grid Information Services Workshop
Research question 2: handling of
time
• Databases are not historical.
– At any moment, database contains most recent
values.
– To look at another way, an update to an attribute
destroys the previous value.
• Streams are historical. If first tuple satisfies
query, second tuple likely will too:
– <<Host1.Id, ts=1300><Host2.Id, ts=1303>>
– <<Host1.Id, ts=1315><Host2.Id, ts=1318>>
24 April 2003
Grid Information Services Workshop
Research question 3: implication of
model and time differences
• Operators have different meanings
depending on model:
– Joins
– Aggregate operators (min, avg, count)
• Queries have different semantics
• Accuracy of results becomes an issue
– False positives far more likely in streaming
environment
Purpose of our work is to address these
research questions.
24 April 2003
Grid Information Services Workshop
System Requirements:
• Stream use must be transparent to user
– i.e., user writes queries to the database schema
• Use language extension to indicate whether
client desires freshest possible response (and
is thus willing to pay price.)
• Efficiency requirement demands that as much
query processing as can be is done closest to
source.
– dQUOB – for extracting data from data streams
24 April 2003
Grid Information Services Workshop
• Approach to filtering, transforming, and
aggregating of streaming data using
declarative query language
• dQUOB - toolkit for creating queries, and
computational entities that execute queries
• Beth Plale and Karsten Schwan, Dynamic Querying
of Data Streams with dQUOB, IEEE Transactions on
Parallel and Distributed Systems, April 2003.
24 April 2003
Grid Information Services Workshop
dQUOB Stream Processing
• SQL query language (subset)
–
–
–
–
Selection, projection
Conjunctive/disjunctive conditions
User defined functions
Joins: time-based
• Two events satisfy join if they ‘happen at same
time’
• Underlying communication layer
– Publish-subscribe event communication system
– Binary encoded events (using Georgia Tech’s
PBIO)
• Mapping relational abstraction to pub/sub:
– Tuple equivalent
to event,
Grid Information Services Workshop
– Relation equivalent to channel (data stream)
24 April 2003
dQUOB strengths for streaming grid
information service
• Highly efficient evaluation of non-trivial
queries
• Queries can be instantiated on-the-fly
• Because of relation->channel mapping,
queries can be dynamically instantiated over
relations (channels) not previously known to
quoblet
• Queries can generate relation that becomes
input relation to new query
24 April 2003
Grid Information Services Workshop
Quoblet Architecture
quoblet
QM
input
streams
output
streams
QN
dispatcher
provider
provider
consumer
consumer
quoblet
24 April 2003
Grid Information Services Workshop
event
stream
Architecture
of approach using
dQUOB
client
SQL query
mediator
‘static’ query
Relational database
‘dynamic’ query
Multiple queries
within single quoblet
Monitoring
service publishes
events to event
channel
Monitoring service
Distributed resources
24 April 2003
Grid Information Services Workshop
Integration Architecture
client
SQL
query
Decomposition and
results formation
target
query
Mediator *
Plan for
Wrapper k
Capability-based
rewrite
Description of types
of queries wrapper
can support
Component
subquery
Wrapper i
database
Wrapper k
quoblets
* Mediator architecture based
on Garlic, IBM Almaden, 1995
24 April 2003
Grid Information Services Workshop
Query execution: case where dynamic
data requested
Resulting
relation as
a stream
client
Mediator
Subquery a
over static
attributes
Subquery b
over dynamic
attributes
Result
relation
Wrapper i
database
Wrapper k
quoblets
Can accept any monitoring infrastructure
that exports stream of events
24 April 2003
Grid Information Services Workshop
Towards a Streaming Relational Grid
Information Service
Beth Plale
http://www.cs.indiana.edu/~plale
24 April 2003
Grid Information Services Workshop
Download