Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science University of Manchester Service-based approaches Service Registry Credentials, roles Publish Find QoS negotiation tokens Infrastructure Monitoring, accounting, notification Service Requester Bind Service Provider facilitate Virtualisation of Resources Leads to A convenient cooperation model for Distributed Systems (e.g. Grid) Context Data Complexity Computational Complexity Web Services Are Not Enough Lack facilities for: Computational resource description. Computational resource discovery. Application staging. Grid Services combine: Web Services for service description and invocation. Grid middleware for computational resource description and utilisation. Open Grid Services Architecture (OGSA) OGSA services are described using WSDL. OGSA service instances are: Created dynamically by factories. Identified through Grid Service Handles. Self describing through Service Data Elements. Stateful, with soft state lifetime management. Current status: Globus 3 beta release in June 2003: www.globus.org. Supports service instances and access to other Globus services. Core database services from OGSA-DAI project tracking Globus releases. Grid Database Service (GDS) Databases are made available on the Grid through integration with other Grid services, and provision of standard interfaces Build upon OGSA to deliver high-level data management functionality for the Grid. Seek to provide two classes of components: Data access components Data integration components GDS interactions RegisterService Registry 2 findServiceData Client G GS 3 1 GDS GDT G Consumer Factory CreateService perform(Query) 4 GDT 5 1 GDSR G GDS Instance G GDSF G DB Distributed Query Processing DQP involves a single query referencing data stored at multiple sites. The locations of the data may be transparent to the author of the query. select p.proteinId, Blast(p.sequence) from protein p, proteinTerm t where t.termId = ‘GO:0005942’ and p.proteinId = t.proteinId J. Smith, A. Gounaris, P. Watson, N. Paton, A. Fernandes, R. Sakellariou, Distributed Query Processing on the Grid, 3rd Int. Workshop on Grid Computing, Springer-Verlag, 279-290, 2002. Mutual Benefit The Grid needs DQP: Declarative, highlevel resource integration with implicit parallelism. DQP-based solutions should in principle run faster than those manually coded. DQP needs the Grid: Systematic access to remote data and computational resources. Dynamic resource discovery and allocation. A Service-Based DQP Architecture Application/Presentation Layer Distributed Query Processor Accounting Versioning Security OGSA-DAI Configuration Logging OGSA Auditing Policy Service-based DQP framework Service-based in two orthogonal sense: Supports querying over data storage and analysis resources made available as services Construction of distributed query plans and their execution over the grid are factored out as services Uses the emerging standard for GDSs to provide consistent access to database metadata and to interact with databases on the Grid. A query may refer to database (GDS) and computational services. Extends OGSA & OGSA-DAI … By adding a new port type and two new services ( and their corresponding factories) : Grid Distributed Query (GDQ) Port Type importSchema operation GDQSR importSchema(GDQDataSourceList GDSL) GDSL : A document containing: the list of Data Sources. The items on this list should contain the handles of the GDS Factories, along with an instance creation document for each factory. And/or a set of WSDL URLs for the analysis services to be used Continued … Grid Distributed Query Service (GDQS) Wraps an existing query compiler/optimiser system which compile, optimise, partition and schedule distributed query execution plans Obtains and maintains metadata and computational resource information required for above Grid Query Evaluator Service (GQES) Each GQES instance is an execution node and is dynamically created by the GDQS on the node it is scheduled to run A GQES is in charge of a partition of the query execution plan assigned to it by the GDQS and is responsible for dispatching the partial results to other GQESs. Setting up a GDQS Set-up strategy depends on the life-time model of GDQS and GDSs GDQS instance is created per-client But it can serve multiple-queries This model avoids complexity of multi-user interactions while ensures that the set-up cost is not high Setup phase involves: Importing schemas of participating data sources Importing WSDL documents of participating analysis services Collecting computational resource metadata (implicit) Issues in Initialisation Q: When is a GDQS bound to a particular GDS? A: The GDS is kept alive until the GDQS expires. Q: Are GDSs shared by multiple GDQSs? A: No. Q: When is a GQES created? A: When the schema of the GDS is imported. Q: What is the lifespan of a GDS used by a GDQS? Q: What is the lifespan of a GQES? A: When a query is about to be evaluated that needs it. A: It lasts only as long as a single query. Q: Is a GQES shared among several queries or GDQSs? A: No. Importing Schemas register 1 N1 GS GDSR G N2 3 1 create Registry GDQSF G Factory GSH:GDQS1 2 importSchema(GSH:GDSF, ConfDoc) findServiceData GDQ GDQS1 G GSH:GDQSF 7 8 6 Client G DBSchema findServiceData GSH:GDS1 GS 4 GDS GDS G 1 Create(ConfDoc) 5 Factory findServiceData ConfigDocs GDSF G GS GDSR findServiceData GS GSH:GDSF G Registry register 1 N3 An example of data source import list <GDQDataSourceList > <importedDataSource> <GDSFactoryHandle> http://130.88.198.203:8080/ogsa/services/ogsadai/GridDataServiceFactory </GDSFactoryHandle> <GDSCreateDocument> <gridDataServiceFactoryCreate > <dataResourceName> myDataResource </dataResourceName> </gridDataServiceFactoryCreate> </GDSCreateDocument> </importedDataSource> <importedService> <wsdlURL> http://www.ebi.ac.uk/collab/mygrid/service0/axis/services/urn:srs?WSDL </wsdlURL> </importedService> </GDQDataSourceList> An example of a Query Document <request name = “myRequest”> <oqlQueryStatement name=“myStat"> <dataResource=“myGenomeDB"> <expression> select p.proteinId, Blast(p.sequence) from proteins p, proteinTerms t where t.termId = ‘GO:0005942 ’ and p.proteinId = t.proteinId </expression> </oqlQueryStatement > <deliverToGDT name="delivery"> <fromLocal name=“myStat"> <toGDT streamId="otherrequestasynch/d1" mode=“full"> http://ogsadai.org.uk/GDTService/my/GDT/GSH </toGDT> </deliverToGDT> </request> Query Compilation OQL Parser Multi-node optimiser Logical Optimiser Physical Optimiser Partitioner Scheduler Single-node Optimiser Evaluator Logical Optimisation Plan is expressed using a logical algebra. Heuristic-based application of equivalence laws. Multiple equivalent plans generated. reduce op_call (Blast) join (proteinId) reduce scan (protein) reduce scan termID=… (proteinTerm) Physical Optimisation Plan is expressed using a physical algebra. Logical operators replaced with physical operators. Cost-based ranking of plans. reduce op_call (Blast) hash_join (proteinId) reduce table_scan (protein) reduce index_scan termID=… (proteinTerm) Partitioning Plan is expressed in a parallel algebra. Parallel algebra = physical algebra + exchange. Exchange operators are placed where data movement may be required. reduce op_call (Blast) exchange hash_join (proteinId) exchange reduce table_scan (protein) exchange reduce index_scan termID=… (proteinTerm) Scheduling Partitions are allocated to Grid nodes; partitions may be merged during scheduling. Expressed by decorating parallel algebra expression. Heuristic algorithm considers memory use, network costs. 3,4 reduce op_call (Blast) exchange 1 hash_join (proteinId) exchange reduce 1 table_scan (protein) exchange reduce 2 table_scan termID=S92 (proteinTerm) Query Evaluation Query installation: GQESs created for partitions as required. Partitions sent to GQESs. Query evaluation: Partitions evaluated using iterator model. Pipelined and partitioned parallelism. Results conveyed to client. An example of a query sub-plan passed to a GQES <Partition isRoot="0"> <evaluatorURI> http://mach1.cs.man.ac.uk:8080/ogsa/services/ogsadai/GQESFactory/GQES1 </evaluatorURI> ... <Operator operatorID="2" operatorType="SEQ_SCAN"> <SEQ_SCAN> <tupleType> <type> string </type> <name> proteinTerms.GOproteinID </name> <type> string </type> <name> proteinTerms.term </name> </tupleType> <inputOperator> <OperatorID> </OperatorID> </inputOperator> <DataResourceName>proteinTermsDataResource</DataResourceName> <GDSHandle> http://mach1.cs.man.ac.uk:8080/…/GridDataServiceFactoryP2R1/GDS1 </GDSHandle> <predicateExpr> <predicate> <comparativeOperator>EQ</comparativeOperator> <leftOperand name="proteinTerms.term" type="tuplefield"/> <rightOperand name="GO:0008372" type="string"/> </predicate> </predicateExpr> </SEQ_SCAN> </Operator> ... </Partition> Interactions of SB-DQP components findServiceData(DBSchema) GS 1 4.1 GDS G Instances GDS registerService 1 createService Registry Factory GDQSF G 3 1 GDSR G GS 7 perform(query) 5 findServiceData Client GDT 4 1 importSchema GDS GDQ GDQS 6 GDT perform(querySubPlan) 6 GDS GQES 1 G GDT . . . 2 1 perform(gqes_query) GDS GQES n G GDT 8 8 results Client G 1 4 GDT N0 GDQ G perform(Query) GDS N2 GDS GDS 3 GQES 2 hash_join (p.proteinID=t.proteinID) G perform(QuerySubplan) GDQS GDT 2 N4 createService reduce (proteinID,sequence) Factory GQESF G GDT 3 sequential_scan GDS perform(QuerySubplan) GQES 1 G reduce (p.proteinID, blast) createService perform(QuerySubplan) 2 Factory GQES F G Web S ervices (BLAST) operation_call blast(p.sequence) 4 4 1 N3 results GDT results GDS 3 GQES 1 G GDT Factory GQESF G 2 createService reduce (p.proteinID, blast) GDS GQES 3 G operation_call blast(p.sequence) Factory GQESF G N1 reduce (proteinID) sequential_scan (term=8372) GDS G Summary DQP on the Grid provides: The normal benefits of DQP. Some added benefits from a Grid setting. The Grid specifically enables: Runtime computational resource discovery. Dynamic creation of remote evaluators. Authentication/Transport services. Access to non-database services. Features of Our GDQS Low cost of entry: Throw-away GDQS: Import sources on a task-specific basis. Discard GDQS when task completed. Builds on parallel database technology: Imports source descriptions through GDSs. Imports service descriptions as WSDL. Implicit parallelism. Pipelined + partitioned parallel evaluation. Public release in July 2003. The SB-DQP Team Manchester: Nedim Alpdemir Anastasios Gounaris Alvaro Fernandes Norman Paton Rizos Sakellariou Newcastle: Arijit Mukherjee Jim Smith Paul Watson