Service-Based Distributed Query Processing on the Grid M.Nedim Alpdemir Department of Computer Science

advertisement
Service-Based Distributed Query
Processing on the Grid
M.Nedim Alpdemir
Department of Computer Science
University of Manchester
Service-based approaches
Service
Registry
Credentials, roles
Publish
Find
QoS negotiation tokens
Infrastructure
Monitoring, accounting,
notification
Service
Requester
Bind
Service
Provider
facilitate
Virtualisation of Resources
Leads to
A convenient cooperation model for Distributed Systems (e.g. Grid)
Context
Data
Complexity
Computational Complexity
Web Services Are Not Enough

Lack facilities for:




Computational resource description.
Computational resource discovery.
Application staging.
Grid Services combine:


Web Services for service description and
invocation.
Grid middleware for computational resource
description and utilisation.
Open Grid Services
Architecture (OGSA)


OGSA services are
described using WSDL.
OGSA service instances
are:




Created dynamically by
factories.
Identified through Grid
Service Handles.
Self describing through
Service Data Elements.
Stateful, with soft state
lifetime management.
Current status:
 Globus 3 beta release in
June 2003:
www.globus.org.
 Supports service
instances and access to
other Globus services.
 Core database services
from OGSA-DAI project
tracking Globus
releases.
Grid Database Service (GDS)



Databases are made available on the Grid through integration
with other Grid services, and provision of standard interfaces
Build upon OGSA to deliver high-level data management
functionality for the Grid.
Seek to provide two classes of components:


Data access components
Data integration components
GDS interactions
RegisterService
Registry
2
findServiceData
Client
G
GS
3
1
GDS
GDT
G
Consumer
Factory
CreateService
perform(Query) 4
GDT
5
1
GDSR
G
GDS
Instance
G
GDSF
G
DB
Distributed Query Processing


DQP involves a single query referencing data
stored at multiple sites.
The locations of the data may be transparent
to the author of the query.
select p.proteinId, Blast(p.sequence)
from
protein p, proteinTerm t
where t.termId = ‘GO:0005942’ and
p.proteinId = t.proteinId
J. Smith, A. Gounaris, P. Watson, N. Paton, A. Fernandes, R. Sakellariou,
Distributed Query Processing on the Grid, 3rd Int. Workshop on Grid
Computing, Springer-Verlag, 279-290, 2002.
Mutual Benefit

The Grid needs
DQP:


Declarative, highlevel resource
integration with
implicit parallelism.
DQP-based solutions
should in principle
run faster than those
manually coded.

DQP needs the Grid:


Systematic access to
remote data and
computational
resources.
Dynamic resource
discovery and
allocation.
A Service-Based DQP Architecture
Application/Presentation Layer
Distributed Query Processor
Accounting
Versioning
Security
OGSA-DAI
Configuration
Logging
OGSA
Auditing
Policy
Service-based DQP framework

Service-based in two orthogonal sense:




Supports querying over data storage and analysis resources
made available as services
Construction of distributed query plans and their execution
over the grid are factored out as services
Uses the emerging standard for GDSs to provide
consistent access to database metadata and to
interact with databases on the Grid.
A query may refer to database (GDS) and
computational services.
Extends OGSA & OGSA-DAI …

By adding a new port type and two new services ( and
their corresponding factories) :

Grid Distributed Query (GDQ) Port Type

importSchema operation
GDQSR importSchema(GDQDataSourceList GDSL)

GDSL : A document containing:


the list of Data Sources. The items on this list should contain the
handles of the GDS Factories, along with an instance creation
document for each factory.
And/or a set of WSDL URLs for the analysis services to be used
Continued …

Grid Distributed Query Service (GDQS)



Wraps an existing query compiler/optimiser system
which compile, optimise, partition and schedule
distributed query execution plans
Obtains and maintains metadata and computational
resource information required for above
Grid Query Evaluator Service (GQES)


Each GQES instance is an execution node and is
dynamically created by the GDQS on the node it is
scheduled to run
A GQES is in charge of a partition of the query execution
plan assigned to it by the GDQS and is responsible for
dispatching the partial results to other GQESs.
Setting up a GDQS

Set-up strategy depends on the life-time
model of GDQS and GDSs




GDQS instance is created per-client
But it can serve multiple-queries
This model avoids complexity of multi-user
interactions while ensures that the set-up cost is
not high
Setup phase involves:



Importing schemas of participating data sources
Importing WSDL documents of participating
analysis services
Collecting computational resource metadata (implicit)
Issues in Initialisation

Q: When is a GDQS
bound to a particular
GDS?



A: The GDS is kept alive
until the GDQS expires.
Q: Are GDSs shared by
multiple GDQSs?

A: No.
Q: When is a GQES
created?

A: When the schema of
the GDS is imported.
Q: What is the lifespan
of a GDS used by a
GDQS?



Q: What is the lifespan
of a GQES?


A: When a query is about
to be evaluated that
needs it.
A: It lasts only as long as
a single query.
Q: Is a GQES shared
among several queries
or GDQSs?

A: No.
Importing Schemas
register
1
N1
GS
GDSR
G
N2
3
1
create
Registry
GDQSF
G
Factory
GSH:GDQS1
2
importSchema(GSH:GDSF, ConfDoc)
findServiceData
GDQ
GDQS1
G
GSH:GDQSF
7
8
6
Client
G
DBSchema findServiceData
GSH:GDS1
GS
4
GDS
GDS
G 1
Create(ConfDoc)
5
Factory
findServiceData
ConfigDocs
GDSF
G
GS
GDSR
findServiceData
GS
GSH:GDSF
G Registry
register
1
N3
An example of data source import list
<GDQDataSourceList >
<importedDataSource>
<GDSFactoryHandle>
http://130.88.198.203:8080/ogsa/services/ogsadai/GridDataServiceFactory
</GDSFactoryHandle>
<GDSCreateDocument>
<gridDataServiceFactoryCreate >
<dataResourceName>
myDataResource
</dataResourceName>
</gridDataServiceFactoryCreate>
</GDSCreateDocument>
</importedDataSource>
<importedService>
<wsdlURL>
http://www.ebi.ac.uk/collab/mygrid/service0/axis/services/urn:srs?WSDL
</wsdlURL>
</importedService>
</GDQDataSourceList>
An example of a Query Document
<request name = “myRequest”>
<oqlQueryStatement name=“myStat">
<dataResource=“myGenomeDB">
<expression>
select p.proteinId, Blast(p.sequence)
from proteins p, proteinTerms t
where t.termId = ‘GO:0005942 ’ and p.proteinId = t.proteinId
</expression>
</oqlQueryStatement >
<deliverToGDT name="delivery">
<fromLocal name=“myStat">
<toGDT streamId="otherrequestasynch/d1" mode=“full">
http://ogsadai.org.uk/GDTService/my/GDT/GSH
</toGDT>
</deliverToGDT>
</request>
Query Compilation
OQL
Parser
Multi-node
optimiser
Logical
Optimiser
Physical
Optimiser
Partitioner
Scheduler
Single-node
Optimiser
Evaluator
Logical Optimisation



Plan is expressed
using a logical
algebra.
Heuristic-based
application of
equivalence laws.
Multiple equivalent
plans generated.
reduce
op_call
(Blast)
join
(proteinId)
reduce
scan
(protein)
reduce
scan
termID=…
(proteinTerm)
Physical Optimisation



Plan is expressed
using a physical
algebra.
Logical operators
replaced with
physical operators.
Cost-based ranking
of plans.
reduce
op_call
(Blast)
hash_join
(proteinId)
reduce
table_scan
(protein)
reduce
index_scan
termID=…
(proteinTerm)
Partitioning



Plan is expressed in
a parallel algebra.
Parallel algebra =
physical algebra +
exchange.
Exchange operators
are placed where
data movement may
be required.
reduce
op_call
(Blast)
exchange
hash_join
(proteinId)
exchange
reduce
table_scan
(protein)
exchange
reduce
index_scan
termID=…
(proteinTerm)
Scheduling



Partitions are allocated
to Grid nodes; partitions
may be merged during
scheduling.
Expressed by decorating
parallel algebra
expression.
Heuristic algorithm
considers memory use,
network costs.
3,4
reduce
op_call
(Blast)
exchange
1
hash_join
(proteinId)
exchange
reduce
1
table_scan
(protein)
exchange
reduce
2
table_scan
termID=S92
(proteinTerm)
Query Evaluation

Query installation:



GQESs created for partitions as required.
Partitions sent to GQESs.
Query evaluation:



Partitions evaluated using iterator model.
Pipelined and partitioned parallelism.
Results conveyed to client.
An example of a query sub-plan passed to a GQES
<Partition isRoot="0">
<evaluatorURI>
http://mach1.cs.man.ac.uk:8080/ogsa/services/ogsadai/GQESFactory/GQES1
</evaluatorURI>
...
<Operator operatorID="2" operatorType="SEQ_SCAN">
<SEQ_SCAN>
<tupleType>
<type> string </type>
<name> proteinTerms.GOproteinID </name>
<type> string </type>
<name> proteinTerms.term </name>
</tupleType>
<inputOperator>
<OperatorID> </OperatorID>
</inputOperator>
<DataResourceName>proteinTermsDataResource</DataResourceName>
<GDSHandle>
http://mach1.cs.man.ac.uk:8080/…/GridDataServiceFactoryP2R1/GDS1
</GDSHandle>
<predicateExpr>
<predicate>
<comparativeOperator>EQ</comparativeOperator>
<leftOperand name="proteinTerms.term" type="tuplefield"/>
<rightOperand name="GO:0008372" type="string"/>
</predicate>
</predicateExpr>
</SEQ_SCAN>
</Operator>
...
</Partition>
Interactions of SB-DQP
components
findServiceData(DBSchema)
GS
1
4.1
GDS
G
Instances
GDS
registerService
1
createService
Registry
Factory GDQSF
G
3
1
GDSR
G
GS
7
perform(query)
5
findServiceData
Client
GDT
4
1
importSchema
GDS
GDQ
GDQS
6
GDT
perform(querySubPlan)
6
GDS
GQES 1
G
GDT
. . .
2
1
perform(gqes_query)
GDS
GQES n
G
GDT
8
8
results
Client
G
1
4
GDT
N0
GDQ
G
perform(Query)
GDS
N2
GDS
GDS
3
GQES 2
hash_join
(p.proteinID=t.proteinID)
G
perform(QuerySubplan)
GDQS
GDT
2
N4
createService
reduce (proteinID,sequence)
Factory GQESF
G
GDT
3
sequential_scan
GDS
perform(QuerySubplan)
GQES 1
G
reduce (p.proteinID, blast)
createService
perform(QuerySubplan)
2
Factory GQES F
G
Web S ervices
(BLAST)
operation_call
blast(p.sequence)
4
4
1
N3
results
GDT
results
GDS
3
GQES 1
G
GDT
Factory GQESF
G
2
createService
reduce (p.proteinID, blast)
GDS
GQES 3
G
operation_call
blast(p.sequence)
Factory GQESF
G
N1
reduce (proteinID)
sequential_scan (term=8372)
GDS
G
Summary

DQP on the Grid provides:



The normal benefits of DQP.
Some added benefits from a Grid setting.
The Grid specifically enables:




Runtime computational resource discovery.
Dynamic creation of remote evaluators.
Authentication/Transport services.
Access to non-database services.
Features of Our GDQS

Low cost of entry:



Throw-away GDQS:



Import sources on a task-specific basis.
Discard GDQS when task completed.
Builds on parallel database technology:



Imports source descriptions through GDSs.
Imports service descriptions as WSDL.
Implicit parallelism.
Pipelined + partitioned parallel evaluation.
Public release in July 2003.
The SB-DQP Team

Manchester:





Nedim Alpdemir
Anastasios Gounaris
Alvaro Fernandes
Norman Paton
Rizos Sakellariou

Newcastle:

Arijit Mukherjee
Jim Smith

Paul Watson

Download