EdSkyQuery-G Overview Brian Hills, December 2004 www.edikt.org Contents Edikt Motivation & Aims Architecture Current Status Results Future Outlook Edikt “e-Science Data, Information and Knowledge Transformation” – Bridge the gap between applications and computer science: Produce robust tools… …for real application science problems… …test them under extreme science conditions… …and keep an eye on the commercial possibilities. – Projects which may be of interest to astronomers: BinX, Eldas & EdSkyquery-G. – Visit: www.edikt.org Astronomy Requirements Sky Surveys collect masses of data to be managed: – For example, Sloan Digital Sky Survey: 15TB. – 2 * 10GB databases to be used for the project. – Edikt will have access to a 155TB SAN. Further research by leveraging data from different surveys: – Must identify same object from different catalogues. Require a “federated” view: – Data is distributed, homogeneous, large scale. – Building one big data warehouse isn’t feasible. – Interoperable services to combine disparate data sources. Is the middleware up to the task? EdSkyQuery-G: Motivation & Aims Support the “Open SkyQuery Initiative” – Move from a .NET-specific implementation. – Enable similar functionality on other platforms. Extensible framework for e-science: – – – – Handle heterogeneous archives. ‘Plug in’ algorithms e.g. Nearest Neighbour. Interact with Astrogrid components & VO. Leveraged for the BRIDGES project to perform simple joins. Apply Eldas to large scale E-Science problems: – Test: functionality, scalability & performance. Cross team collaboration: – Dr. Bob Mann (UoE, ROE), Edikt, EPCC, NeSC. SkyQuery: High Level Architecture 1. User submits query. Client 9. Returns results for display. 2. Client sends query to Portal. Portal 8. Performs crossmatch and returns final set of results. 3. Portal invokes SkyNode with an execution plan. SkyNode 1 7. Performs crossmatch and returns results. 4. Recursive call to next SkyNode in the execution plan. SkyNode 2 6. Runs query and returns results. 5. Recursive call to last SkyNode in the execution plan. SkyNode 3 EdSkyQuery-G: Architecture Inspired by Greg Riccardi’s paper for DAIS-WG: – http://www.cs.fsu.edu/~riccardi/grid/skyquery.pdf Discusses two approaches: 1. Access recipes for service interactions. 2. Retained state for service interactions. Potential benefits of #2: – – – Scalability Robustness Usability EdSkyQuery-G: Architecture 1. User submits query. Query Builder 9. Client may pull results back from SkyNode. 2. Query Builder sends query to Service Manager (SM). 8. SM returns a handle to the results. 5. SM invokes SkyNode 1 with an execution plan + results handle. Service Manager 7. Handle to results. 3. SM invokes SkyNode 1 with an execution plan. 4. Handle to results. SkyNode 2 SkyNode 1 DB2 - Linux DB2 - XP 6. Transfer results file. EdSkyQuery-G: Service Manager Client Uses Service Client Uses Service getMetadata doQuery JBOSS Grid Service Layer Generic Parser Distributed Query Processor Split Optimize Call to Query service Service Manager Execute Call to Xmatch service Call to Info service EdSkyQuery-G: SkyNode Architecture Invoked via the Service Manager Query XMatch Info EldasSkyNode Grid Services SkyNode Query & Stored Proc invocations DB2 EdSkyQuery-G: Stored Procedures Export Transport Import EDR Export Export SSA Transport Results: .csv .xml Import Transport Results: .csv .xml Export Results: .csv .xml Current Status Pre-Alpha (internal release), November ‘04. End-to-end invocation of all components: – Client->ServiceManager->SkyNode->Database 2 * 10GB DB2 test databases: – SSA from SuperCosmos, hosted by NeSC. – EDR from SDSS, hosted by EPCC. Limitations: – – – – SM: No query parser/splitting. Simple cross database join: not yet using XMatch. No data transport, other than JDBC between databases. Final results reside on database server. EdSkyQuery-G: Pre-Alpha Stored Procedures Export Import Export JDBC Query EDR SSA Results Import Results: .csv .xml Export Results: .csv .xml Results Tested with 3 queries from ROE cookbook: – http://surveys.roe.ac.uk/ssa/sqlcookbook.html – Queries #16, #17, #19. Results show: – Exporting data is quick. – Importing data is >10 * slower: Should we use native DB calls rather than JDBC for import only? – Queries slow: Need more indexes and database tuning? Query No. #Rows selected Export Time (secs) Import Time (secs) Join Query Time (secs) Total Time (secs) 16 488,718 130 1585 1110 2825 17 383,672 96 1138 1470 2704 19 4,667 82 15 1474 1571 Deliverables Software (Internal): – – – – Prototype client & GUI client. Service Manager. Eldas + Skynode Interface. Database stored procedures (Java). Documentation (some on NescForge): – Use Cases, Requirements, Design. – Performance Testing, Installation. Papers – AHM 04 Poster & Paper: http://www.allhands.org.uk/proceedings/papers/123.pdf – ADASS 04 Paper. Short Term Focus Enable science: – Compare different cross match algorithms. – Incorporate XMatch, CSIRO, ROE as stored procedures. Data Transfer – SCP: – Between databases. – Pull results back to client. – Deliver to a third party. Client & Service Manager enhancements: – Handle broader range of queries. Performance: – Compare with pre-alpha benchmarks. Improve test infrastructure. Longer Term Focus: AstroGrid MySpace Parser Query Builder Client Service Manager Registry SkyNode 2 SkyNode 1 DB2 - Linux DB2 - XP Longer Term Focus Interaction with Astrogrid components: – Clients, Registry, ADQL Parser, MySpace. OpenSkyQuery: – Compliance with interfaces. – Test with other DBMS and catalogues. Query Builder GUI/Data Integration tool: – Lead user through choosing fields from different datasets. GridFTP: – Currently unsuitable…. – NeSC course in Jan 05 may reveal more. Thank you Questions?