IBM TCG Symposium Prof. Malcolm Atkinson Director www.nesc.ac.uk 21st May 2003 1 Outline What is e-Science? UK e-Science UK e-Science: Roles and Resources UK e-Science: Projects NeSC & e-Science Institute Events Visitors International Programme Scientific Data Curation Data Access & Integration Data Analysis & Interpretation e-Science driving Disruptive Technology Economic impact, Mobile Code, Decomposition Global infrastructure, optimisation & management “Don’t care where” computing 2 3 Foundation for e-Science e-Science methodologies will rapidly transform science, engineering, medicine and business driven by exponential growth (×1000/decade) X enabling a whole-system approach computers software Grid sensor nets instruments colleagues Shared data archives 4 Convergence & Ubiquity Multi-national, Multi-discipline, Computer-enabled Consortia, Cultures & Societies Experiment & Advanced Data Collection → Shared Data Theory Models & Simulations → Shared Data Requires Much Computing Science Engineering, Systems, Notations & Much Innovation Formal Foundation Changes Culture, New Mores, New Behaviours → Trust New Opportunities, New Results, New Rewards 5 UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign 6 global in-flight engine diagnostics in-flight data airline ground station 100,000 engines 2-5 Gbytes/flight 5 flights/day = 2.5 PB/day global network eg SITA DS&S Engine Health Center internet, e-mail, pager maintenance centre data centre Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York 7 Tera → Peta Bytes RAM time to move 15 minutes 1Gb WAN move time 10 hours ($1000) Disk Cost 7 disks = £3500 (SCSI) Disk Power 100 Watts Disk Weight 5.6 Kg Disk Footprint Inside machine RAM time to move 2 months 1Gb WAN move time 14 months ($1 million) Disk Cost 6800 Disks + 490 units + 32 racks = £4.7 million Disk Power 100 Kilowatts Disk Weight 33 Tonnes Disk Footprint 60 m2 May 2003 Approximately Correct 8 9 UK 2000 Spending Review e-Science and the Grid ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’ ‘e-Science will change the dynamic of the way science is undertaken.’ John Taylor Director General of Research Councils Office of Science and Technology 10 From presentation by Tony Hey Additional UK e-Science Funding First Phase: 2001 –2004 Application Projects £74M All areas of science and engineering >60 Projects 340 at first All Hands Mtg Core Programme £35M Collaborative industrial projects ~80 Companies > £30 Million Second Phase: 2003 –2006 Application Projects £96M All areas of science and engineering Core Programme £16M + £25M (?) Core Grid Middleware 11 e-Science and SR2002 MRC BBSRC NERC EPSRC HPC Core Prog. PPARC ESRC CLRC 2004-6 £13.1M £10.0M £8.0M £18.0M £2.5M £16.2M + ? £31.6M £10.6M £5.0M 2001-4 (£8M) (£8M) (£7M) (£17M) (£9M) (£15M) + £20M (£26M) (£3M) (£5M) 12 13 NeSC in the UK You are here HPC(x) Glasgow Edinburgh Directors’ Forum Newcastle Helped build a community Belfast Engineering Task Force Grid Support Centre Manchester Daresbury Lab Architecture Task Force UK Adoption of OGSA Cambridge OGSA Grid Market Oxford Workflow Management Hinxton RAL Database Task Force Cardiff London OGSA-DAI GGF DAIS-WG Southampton e-SI Programme training, coordination, community building, workshops, pioneering GridNet e-Storm 14 www.nesc.ac.uk 15 UK Grid: Operational & Heterogeneous Currently a Level-2 Grid based on Globus Toolkit 2 Transition to OGSI/OGSA will prove worthwhile There are still issues to be resolved OGSA definition / delivery Hosting environments & Platforms Combinations of Services supported Material and grids to support adopters A schedule of transitions should be (approximately & provisionally) published Expected time line Now GT2 L2 service + GT3 M/W development & evaluation Q3 & Q4 2003 GT2 L3 + GT3 L1 Q1 & Q2 2004 significant project transitions to GT3 L2/L3 Late Q4 2004 most projects have transitioned + end GT2 L3 16 17 e-Science Institute Past Programme of Events Planned 6 * two-week research workshops / year Actually ran 48 events in first 12 months! Highlights GGF5, HPDC11 and a cluster of workshops Protein Science, Neuroinformatics, … Major training events X X 900 800 700 600 500 400 300 200 100 0 Steve Tuecke Grid & Globus (2) Web Services, DiscoveryLink, Relational DB design, … GGF1 GGF2 GGF3 GGF4 GGF5 GGF6 GGF7 e-SI Clientele and Outreach (year 1) > 2600 individuals From > 500 organisations 236 speakers Many participants return frequently Event Diversity Conferences Summer schools Workshops Training Community building Company Sponsorship & International Engagement 18 19 Biology & Medicine Extensive Research Community >1000 per research university Extensive Applications Many people care about them X Health, Food, Environment Interacts with virtually every discipline Physics, Chemistry, Nanoengineering, … 450 Databases relevant to bioinformatics Heterogeneity, Interdependence, Complexity, Change, … Wonderful Scientific Questions How does a cell work? How does a brain work? How does an organism develop? Why is the biosphere so stable? What happens to the biosphere when the earth warms up? … 20 Database Growth PDB Content Growth 21 ODD-Genes database engine 1 registry database engine 2 PSE 22 Scientific Data Opportunities Global Production of Published Data Volume↑ Diversity↑ Combination ⇒ Analysis ⇒ Discovery Opportunities Specialised Indexing New Data Organisation New Algorithms Varied Replication Shared Annotation Intensive Data & Computation Challenges Data Huggers Meagre metadata Ease of Use Optimised integration Dependability Challenges Fundamental Principles Approximate Matching Multi-scale optimisation Autonomous Change Legacy structures Scale and Longevity Privacy and Mobility 23 Infrastructure Architecture Data Intensive X Scientists Data Intensive Applications for Science X Simulation, Analysis & Integration Technology for Science X Generic Virtual Data Access and Integration Layer Job Submission Brokering Registry Banking Data Transport Workflow Structured Data Integration Authorisation OGSA Resource Usage Transformation Structured Data Access OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Structured Data Relational Distributed Virtual Integration Architecture XML Semi-structured 24 Draft Specification for GGF 7 25 26 Mohammed & Mountains Petabytes of Data cannot be moved It stays where it is produced or curated X Hospitals, observatories, European Bioinformatics Institute, … Distributed collaborating communities Expertise in curation, simulation & analysis Distributed & diverse data collections Discovery depends on insights Tested by combining data from many sources Using sophisticated models & algorithms What can you do? 27 Move computation to the data Assumption: code size << data size Develop the database philosophy for this? Queries are dynamically re-organised & bound Develop the storage architecture for this? Compute on disk? (SoC + space on disk chips??) Safe hosting of arbitrary computation Proof-carrying code for data and compute intensive tasks + robust hosting environments Provision combined storage & compute resources Decomposition of applications To ship behaviour-bounded sub-computations to data Co-scheduling & co-optimisation Data & Code (movement), Code execution Recovery and compensation 28 Software Changes Integrated Problem Solving Environments Users & application developers see X Abstract computer and storage system Where and how things are executed can be ignored Diversity, detail, ownership, dependability, cost X Explicit and visible Increasing sophistication of description X X Metadata for discovery Metadata for management and optimisation Applications developed dynamically by composition Mobile, Safe & Re-organisable Code Predictable behaviour Decomposition & re-composition New programming languages & understanding needed 29 Organisational & Cultural Changes Access to Computation & Data must be simple All use a computational, semantic, data-rich web Responsibility of data publishers Cost, dependability, trustworthy, capable, flexibility, … Shared contributions compose indefinitely Knowledge accumulation and interdependence Contributor recognition and IPR Complexity and management of infrastructure Always on Must be sustained Paid for Hidden 30 www.ogsadai.org.uk www.nesc.ac.uk 31 32 DAI basic Services 1a. Request to Registry for sources of data about “x” SOAP/HTTP Registry 1b. Registry responds with Factory handle service creation API interactions 2a. Request to Factory for access to database Factory Client 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3c. Results of query returned to client as XML 2b. Factory creates GridDataService to manage access Grid Data Service XML / Relational database 3b. GDS interacts with database 33 DAIT basic Services 1a. Request to Registry for sources of data about “x” & “y” 1b. Registry responds with Factory handle Data Registry SOAP/HTTP service creation API interactions 2a. Request to Factory for access and integration from resources Sx and Sy Factory 2c. Factory returns handle of GDS to client 3b. Client tells analyst Client 2b. Factory creates GridDataServices network 3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc GDTS1 Analyst GDS GDTS GDS2 3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation Sx GDS GDS1 XML database Sy GDS3 GDS GDTS2 Relational database GDTS 34