A Grid Data Integration Service (OGSA-DQP) Paul Watson, University of Newcastle-upon-Tyne based on the work of… Norman Paton, Tasos Gounaris, Alvaro Fernandes, Rizos Sakellariou University of Manchester Jim Smith, Arijit Mukherjee, Paul Watson University of Newcastle-upon-Tyne www.neresc.ac.uk The Problem • Many grid applications would benefit from access to distributed data • Data sources are scattered and autonomous • Integration is often done by tedious manual process • or (recently) hand-coded workflows • We are interested in how to simplify the process of querying distributed data • Focussing initially on information held in (relational) databases www.neresc.ac.uk 2 Distributed Query Processing • Queries are expressed in OQL • allows computations to be included in the query • A single query may reference data at multiple sites • the data locations may be transparent to the query author select p.proteinId, Blast(p.sequence) from protein p, proteinTerm t where t.termId = ‘S92’ and p.proteinId = t.proteinId www.neresc.ac.uk 3 Query Compiler OGSA-DQP automatically compiles and executes the query on a set of Grid nodes - in parallel where possible OQL Parser Multi-node optimiser www.neresc.ac.uk Logical Optimiser Physical Optimiser Partitioner Scheduler Single-node optimiser Evaluator 4 Execution Plan select p.proteinId, Blast(p.sequence) from protein p, proteinTerm t where t.termId = ‘S92’ and p.proteinId = t.proteinId • The plan is split in to a set of partitions • Grid resources are acquired to execute the partitions • in parallel where possible, required and affordable 9,10 3-8 reduce op_call (Blast) exchange hash_join (proteinId) exchange reduce exchange reduce 1 2 table_scan (protein) www.neresc.ac.uk table_scan termID=S92 (proteinTerm) 5 Evaluation on the Grid • The OGSA-DQP builds on OGSA-DAI • accesses relational databases wrapped by OGSA-DAI • Oracle, DB2, MySQL • Data streams between nodes • flow control • All services are OGSI-compliant • built on GT3 www.neresc.ac.uk 6 Execution on the Grid results Client G 1 4 GDT N0 GDQ G perform(Query) GDS N2 GDS GDS 3 GQES 2 hash_join (p.proteinID=t.proteinID) G perform(QuerySubplan) GDQS GDT 2 N4 createService reduce (proteinID,sequence) Factory GQESF G GDT 3 sequential_scan GDS perform(QuerySubplan) GQES 1 G reduce (p.proteinID, blast) createService perform(QuerySubplan) 2 Factory GQES F G Web S ervices (BLAST) operation_call blast(p.sequence) 4 4 1 N3 results GDT results GDS 3 GQES 1 G GDT Factory GQESF G 2 createService reduce (p.proteinID, blast) GDS GQES 3 G operation_call blast(p.sequence) Factory GQESF G N1 reduce (proteinID) sequential_scan (term=8372) GDS G www.neresc.ac.uk 7 Mutual Benefit The Grid needs DQP: DQP needs the Grid: • Declarative, high-level resource integration with implicit parallelism • Systematic access to remote data and computational resources • Cost based optimisation • Dynamic resource discovery and allocation www.neresc.ac.uk 8 Summary • DQP is a potentially important technology for the Grid • OGSA-DQP supports: • • • • • declarative expression of queries location transparency access to both data and computational resources dynamic deployment on Grid resources implicit parallelism • First release made in September 2003 • available for download • Dynamic adaptation now being investigated • fault-tolerance, performance, cost www.neresc.ac.uk 9 Experiences and Issues • Remote service deployment not yet available for Grids, but some work… • PhD Project at Newcastle (Chris Fowler) • • • • dynamically deploy individual services remotely initial prototype by end of November 2003 working on security issues WS only • GridShed project (Newcastle + BT) • design of hosting environments for Grids • install execution images on nodes as required www.neresc.ac.uk 10 Experiences & Issues • DQP vs Workflow? • for what space of problems is each better • DQP advantages? • declarative expression of intent • cost-based choice of execution plans • implicit parallelisation • Investigating with Bioinformatics applications in the myGrid project • DQP with workflows & workflows with DQP www.neresc.ac.uk 11 Projects/Sponsors Projects • OGSA-DAI • Polar • Polar* • myGrid www.neresc.ac.uk Sponsors 12