Research Questions in Streaming Relational Grid Information Service Beth Plale Computer Science Dept. Indiana University 24 April 2003 Grid Information Services Workshop Challenge: • Grid information service by its nature maintains highly dynamic data. • Requirement creates competing client demands of freshness of data and fast query response times. • Good solutions emerging but more research needed to fully explore space. 24 April 2003 Grid Information Services Workshop Definitions: • Grid information service – repository of information about resources on grid – ‘Resource’ very broadly defined – Resource described by set of attributes. – Attributes can be static or dynamic • Dynamic attributes – All attributes are dynamic to some degree – We consider an attribute to be dynamic if it could be refreshed at or faster than once per several seconds 24 April 2003 Grid Information Services Workshop How do we know fresh results and fast query response times are competing needs in grid information services? • J. Schopf HPDC 2003 paper • W. Smith, HPDC 1999 paper • Our work on synthetic database workload 24 April 2003 Grid Information Services Workshop Synthetic Database Workload • Synthetic database benchmark - consists of – 16 queries and updates • Carefully chosen to exercise broad aspects of platform – 4 scenarios • ‘Scenario’ - controlled workload of queries and updates. Exposes query response time sensitivity to update workload. • Started from ER diagram (or UML) – 34 tables/collections/object classes – Based on Glue v8 – Extended with users, user accounts, logical connections 24 April 2003 Grid Information Services Workshop Synthetic Database Workload, cont. • Data – 81,684 tuples/documents/objects per database – Approx. 10MB (relational representation) • Target platforms – mySQL v4.0, Xindice v1.1, MDS GT 2.2 • Evaluating against two metrics: – Query response time – Ease of use • Expressiveness of query language • Amount of information returned to user 24 April 2003 Grid Information Services Workshop Our observations add to emerging body of knowledge that for all three platforms tested, query response time deteriorates noticeably under rapid update rates 24 April 2003 Grid Information Services Workshop Approach to streamingRGR • Assumption: relational solution is viable solution for a grid information service. • Observation: place to look for most recent data is in data streams. • Approach: query database for static data, but pull dynamic data directly from data stream • If query is investigative (i.e. exploratory), then must support query of database for results that are ‘good enough’ . 24 April 2003 Grid Information Services Workshop Skeleton architecture of approach client SQL query Query rewrite and results composition (very thin layer) ‘static’ portion of query Relational database ‘dynamic’ portion of query Controlled updates Stream query Monitoring service (ganglia, NWS, Remos, RGM-A,vmstat) Distributed resources (hosts, clusters, switches, routers) 24 April 2003 Grid Information Services Workshop Research questions to combining streaming/database solution • Research question 1: Differing models: – Database: • Queries arrive at database, • Queries issued once, and • Issued against relatively stable set of tuples – Streams: • Events arrive at database, • Queries are long lived, and • Issued upon event arrival against sliding window of data. 24 April 2003 Grid Information Services Workshop Research question 2: handling of time • Databases are not historical. – At any moment, database contains most recent values. – To look at another way, an update to an attribute destroys the previous value. • Streams are historical. If first tuple satisfies query, second tuple likely will too: – <<Host1.Id, ts=1300><Host2.Id, ts=1303>> – <<Host1.Id, ts=1315><Host2.Id, ts=1318>> 24 April 2003 Grid Information Services Workshop Research question 3: implication of model and time differences • Operators have different meanings depending on model: – Joins – Aggregate operators (min, avg, count) • Queries have different semantics • Accuracy of results becomes an issue – False positives far more likely in streaming environment Purpose of our work is to address these research questions. 24 April 2003 Grid Information Services Workshop System Requirements: • Stream use must be transparent to user – i.e., user writes queries to the database schema • Use language extension to indicate whether client desires freshest possible response (and is thus willing to pay price.) • Efficiency requirement demands that as much query processing as can be is done closest to source. – dQUOB – for extracting data from data streams 24 April 2003 Grid Information Services Workshop • Approach to filtering, transforming, and aggregating of streaming data using declarative query language • dQUOB - toolkit for creating queries, and computational entities that execute queries • Beth Plale and Karsten Schwan, Dynamic Querying of Data Streams with dQUOB, IEEE Transactions on Parallel and Distributed Systems, April 2003. 24 April 2003 Grid Information Services Workshop dQUOB Stream Processing • SQL query language (subset) – – – – Selection, projection Conjunctive/disjunctive conditions User defined functions Joins: time-based • Two events satisfy join if they ‘happen at same time’ • Underlying communication layer – Publish-subscribe event communication system – Binary encoded events (using Georgia Tech’s PBIO) • Mapping relational abstraction to pub/sub: – Tuple equivalent to event, Grid Information Services Workshop – Relation equivalent to channel (data stream) 24 April 2003 dQUOB strengths for streaming grid information service • Highly efficient evaluation of non-trivial queries • Queries can be instantiated on-the-fly • Because of relation->channel mapping, queries can be dynamically instantiated over relations (channels) not previously known to quoblet • Queries can generate relation that becomes input relation to new query 24 April 2003 Grid Information Services Workshop Quoblet Architecture quoblet QM input streams output streams QN dispatcher provider provider consumer consumer quoblet 24 April 2003 Grid Information Services Workshop event stream Architecture of approach using dQUOB client SQL query mediator ‘static’ query Relational database ‘dynamic’ query Multiple queries within single quoblet Monitoring service publishes events to event channel Monitoring service Distributed resources 24 April 2003 Grid Information Services Workshop Integration Architecture client SQL query Decomposition and results formation target query Mediator * Plan for Wrapper k Capability-based rewrite Description of types of queries wrapper can support Component subquery Wrapper i database Wrapper k quoblets * Mediator architecture based on Garlic, IBM Almaden, 1995 24 April 2003 Grid Information Services Workshop Query execution: case where dynamic data requested Resulting relation as a stream client Mediator Subquery a over static attributes Subquery b over dynamic attributes Result relation Wrapper i database Wrapper k quoblets Can accept any monitoring infrastructure that exports stream of events 24 April 2003 Grid Information Services Workshop Towards a Streaming Relational Grid Information Service Beth Plale http://www.cs.indiana.edu/~plale 24 April 2003 Grid Information Services Workshop