IBM UK & Ireland Technical Consultancy Group Prof. Malcolm Atkinson Director www.nesc.ac.uk 22nd May 2003 1 Outline What is e-Science? UK e-Science UK e-Science: Roles and Resources Scientific Data Curation Data Access & Integration Data Analysis & Interpretation e-Science driving Disruptive Technology Economic impact, Mobile Code, Decomposition Global infrastructure, optimisation & management “Don’t care where” computing 2 3 Foundation for e-Science e-Science methodologies will rapidly transform science, engineering, medicine and business driven by exponential growth (×1000/decade) enabling a whole-system approach computers software Grid sensor nets instruments colleagues Shared data archives 4 Convergence & Ubiquity Multi-national, Multi-discipline, Computer-enabled Consortia, Cultures & Societies Experiment & Advanced Data Collection → Shared Data Theory Models & Simulations → Shared Data Requires Much Computing Science Engineering, Systems, Notations & Much Innovation Formal Foundation Changes Culture, New Mores, New Behaviours → Trust New Opportunities, New Results, New Rewards 5 UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign 6 global in-flight engine diagnostics in-flight data airline ground station 100,000 engines 2-5 Gbytes/flight 5 flights/day = 2.5 petabytes/day global network eg SITA DS&S Engine Health Center internet, e-mail, pager maintenance centre data centre Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York 7 Tera → Peta Bytes RAM time to move 15 minutes 1Gb WAN move time 10 hours ($1000) Disk Cost RAM time to move 2 months 1Gb WAN move time 14 months ($1 million) Disk Cost 7 disks = $5000 (SCSI) Disk Power 100 Watts Disk Weight 5.6 Kg Disk Footprint Inside machine 6800 Disks + 490 units + 32 racks = $7 million Disk Power 100 Kilowatts Disk Weight 33 Tonnes Disk Footprint 60 m2 May 2003 Approximately Correct 8 See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24 9 Additional UK e-Science Funding First Phase: 2001 –2004 Application Projects £74M All areas of science and engineering >60 Projects 340 at first All Hands Mtg Core Programme £35M Collaborative industrial projects ~80 Companies > £30 Million Second Phase: 2003 –2006 Application Projects £96M All areas of science and engineering Core Programme £16M + £25M (?) Core Grid Middleware 10 e-Science and SR2002 2004-6 Medical £13.1M Biological £10.0M Environmental £8.0M Eng & Phys £18.0M HPC £2.5M Core Prog. £16.2M + ? Particle Phys & Astro £31.6M Economic & Social £10.6M Central Labs £5.0M Research Council 2001-4 (£8M) (£8M) (£7M) (£17M) (£9M) (£15M) + £20M (£26M) (£3M) (£5M) 11 NeSC in the UK You are here HPC(x) Glasgow Edinburgh Directors’ Forum Newcastle Helped build a community Belfast Engineering Task Force Grid Support Centre Manchester Daresbury Lab Architecture Task Force UK Adoption of OGSA Cambridge OGSA Grid Market Oxford Workflow Management Hinxton RAL Database Task Force Cardiff OGSA-DAI London GGF DAIS-WG Southampton e-SI Programme training, coordination, community building, workshops, pioneering GridNet e-Storm 12 UK Grid: Operational & Heterogeneous Currently a Level-2 Grid based on Globus Toolkit 2 Transition to OGSI/OGSA will prove worthwhile There are still issues to be resolved OGSA definition / delivery Hosting environments & Platforms Combinations of Services supported Material and grids to support adopters A schedule of transitions should be (approximately & provisionally) published Expected time line Now GT2 L2 service + GT3 M/W development & evaluation Q3 & Q4 2003 GT2 L3 + GT3 L1 Q1 & Q2 2004 significant project transitions to GT3 L2/L3 Late Q4 2004 most projects have transitioned + end GT2 L3 13 14 Biology & Medicine Extensive Research Community >1000 per research university Extensive Applications Many people care about them Health, Food, Environment Interacts with virtually every discipline Physics, Chemistry, Nanoengineering, … 450 Databases relevant to bioinformatics Heterogeneity, Interdependence, Complexity, Change, … Wonderful Scientific Questions How does a cell work? How does a brain work? How does an organism develop? Why is the biosphere so stable? What happens to the biosphere when the earth warms up? … 15 Database Growth PDB Content Growth 39,856,567,747 16 Infrastructure Architecture Data Intensive X Scientists Data Intensive Applications for Science X Simulation, Analysis & Integration Technology for Science X Generic Virtual Data Access and Integration Layer Job Submission Brokering Registry Banking Data Transport Workflow Structured Data Integration Authorisation OGSA Resource Usage Transformation Structured Data Access OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Structured Data Relational Distributed Virtual Integration Architecture XML Semi-structured 17 Data Access & Integration Services 1a. Request to Registry for sources of data about “x” SOAP/HTTP Registry 1b. Registry responds with Factory handle service creation API interactions 2a. Request to Factory for access to database Factory Client 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3c. Results of query returned to client as XML 2b. Factory creates GridDataService to manage access Grid Data Service XML / Relationa l database 3b. GDS interacts with database 18 Data Access and Integration Services 1a. Request to Registry for sources of data about “x” & “y” 1b. Registry responds with Factory handle Data Registry SOAP/HTTP service creation API interactions 2a. Request to Factory for access and integration from resources Sx and Sy Factory 2c. Factory returns handle of GDS to client 3b. Client Problem tells“scientific” Solving analyst Client Application Environment coding scientific insights Analyst 2b. Factory creates Semantic GridDataServices network Meta data 3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc GDTS1 GDS GDTS XML database GDS2 Sx 3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation GDS GDS1 Sy GDS3 GDS GDTS2 Relational database GDTS 19 ODD-Genes database engine 1 registry database engine 2 PSE 20 Scientific Data Opportunities Global Production of Published Data Volume Diversity Combination Analysis Discovery Opportunities Specialised Indexing New Data Organisation New Algorithms Varied Replication Shared Annotation Intensive Data & Computation Challenges Data Huggers Meagre metadata Ease of Use Optimised integration Dependability Challenges Fundamental Principles Approximate Matching Multi-scale optimisation Autonomous Change Legacy structures Scale and Longevity Privacy and Mobility 21 22 Mohammed & Mountains Petabytes of Data cannot be moved It stays where it is produced or curated Hospitals, observatories, European Bioinformatics Institute, … Distributed collaborating communities Expertise in curation, simulation & analysis Distributed & diverse data collections Discovery depends on insights Tested by combining data from many sources Using sophisticated models & algorithms What can you do? 23 Move computation to the data Assumption: code size << data size Develop the database philosophy for this? Queries are dynamically re-organised & bound Develop the storage architecture for this? Compute closer to disk? Dave Patterson SIGMOD 98 System on a Chip using free space in the on-disk controller Safe hosting of arbitrary computation Proof-carrying code for data and compute intensive tasks + robust hosting environments Provision combined storage & compute resources Decomposition of applications To ship behaviour-bounded sub-computations to data Co-scheduling & co-optimisation Data & Code (movement), Code execution Recovery and compensation 24 Software Changes Integrated Problem Solving Environments Users & application developers see Abstract computer and storage system Where and how things are executed can be ignored Diversity, detail, ownership, dependability, cost Explicit and visible Increasing sophistication of description Metadata for discovery Metadata for management and optimisation Raising the semantic level of discourse Applications developed dynamically by composition Mobile, Safe & Re-organisable Code Predictable & Guaranteed behaviour Decomposition & re-composition New programming languages & understanding needed 25 Organisational & Cultural Changes Access to Computation & Data must be simple All use a computational, semantic, data-rich web i.e. its invisible – the portal / browser lets you do more Responsibility of data publishers Cost, dependability, trustworthy, capable, flexibility, … Shared contributions compose indefinitely Knowledge accumulation and interdependence Contributor recognition and IPR Complexity and management of infrastructure Always on Must be sustained Paid for Hidden 26 www.ogsadai.org.uk www.nesc.ac.uk 27