Data Challenges in e-Science Aberdeen Prof. Malcolm Atkinson Director www.nesc.ac.uk 2nd December 2003 What is e-Science? Foundation for e-Science e-Science methodologies will rapidly transform science, engineering, medicine and business driven by exponential growth (×1000/decade) enabling a whole-system approach computers software Grid sensor nets instruments Diagram derived from Ian Foster’s slide colleagues Shared data archives Three-way Alliance Multi-national, Multi-discipline, Computer-enabled Consortia, Cultures & Societies Theory Models & Simulations → Shared Data Requires Much Computing Science Engineering, Systems, Notations & Much Innovation Formal Foundation Experiment & Advanced Data Collection → Shared Data Changes Culture, New Mores, New Behaviours → Process & Trust New Opportunities, New Results, New Rewards Biochemical Pathway Simulator (Computing Science, Bioinformatics, Beatson Cancer Research Labs) Closing the information loop – between lab and computational model. DTI Bioscience Beacon Project Harnessing Genomics Programme Slide from Professor Muffy Calder, Glasgow Why is Data Important? Data as Evidence – all disciplines Collections of Data Analysis & Models Hypothesis, Curiosity or … Information, knowledge, decisions & designs Derived & Synthesised Data Driven by creativity, imagination and perspiration Data as Evidence - Historically Personal collection Individual’s idea Collections of Data Analysis & Models Hypothesis, Curiosity or … Information, knowledge, decisions & designs Personal effort Derived & Synthesised Data Lab Notebook Driven by creativity, imagination, perspiration & personal resources Data as Evidence – Enterprise Scale Enterprise databases & (archive) file stores Agreed Hypothesis or Goals Collections of Digital Data Hypothesis, Curiosity, Business Goals Information, knowledge, decisions & designs Data Production Pipelines Analysis & Computational Models Derived & Synthesised Data Data Products & Results Driven by creativity, imagination, perspiration & company’s resources Data as Evidence – e-Science Multi-enterprise & Public Curation Communities and Challenges Collections of Published & Private Data Shared Goals Multiple hypotheses Information, knowledge, decisions & designs Synthesis from Multiple Sources Multi-enterprise Models, Computation & Workflow Analysis Computation Annotation Derived & Synthesised Data Shared Data Products & Results Driven by creativity, imagination, perspiration & shared resources global in-flight engine diagnostics 100,000 aircraft 0.5 GB/flight in-flight data 4 flights/day 200 TB/day global network eg SITA airline ground station DS&S Engine Health Center internet, e-mail, pager maintenance centre data centre Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York LHC Distributed Simulation & Analysis 1 TIPS = 25,000 SpecInt95 ~PBytes/sec Online System ~100 MBytes/sec •100 triggers per second •Each event is ~1 Mbyte US Regional Centre Tier 3 ~ Gbits/sec or Air Freight Italian Regional Centre Institute Institute ~0.25TIPS Workstations CERN Computer Centre >20 TIPS Tier 0 French Regional Centre Tier 2 ~Gbits/sec Physics data cache PC (1999) = ~15 SpecInt95 Offline Farm ~20 TIPS ~100 MBytes/sec •One bunch crossing per 25 ns Tier 1 1. CERN ScotGRID++ ~1 TIPS RAL Regional Centre Tier2 Centre Tier2 Centre Tier2 Centre ~1 TIPS ~1 TIPS ~1 TIPS Physicists work on analysis “channels” Institute Institute 100 - 1000 Mbits/sec Tier 4 Each institute has ~10 physicists working on one or more channels Data for these channels should be cached by the institute server DataGrid Testbed Testbed Sites(>40) HEP sites ESA sites Dubna Lund Moscow RAL Estec KNMI Berlin IPSL Paris Santander Lisboa CERN Prague Brno Lyon Grenoble Milano PD-LNL Torino Madrid Marseille Pisa BO-CNAF Barcelona ESRIN Roma Valencia Catania Francois.Etienne@in2p3.fr - Antonia.Ghiselli@cnaf.infn.it Multiple overlapping communities Collections of Published Collections Collections & Private Data of Published of Published & Private DataData & Private Shared Goals Multiple Shared Goals hypotheses Multiple Shared Goals hypotheses Multiple hypotheses Analysis Computation Analysis Annotation Computation Analysis Annotation Computation Derived & Annotation Derived & Synthesised Synthesised Data Data Derived & Synthesised Data Supported by common standards & shared infrastructure Life-science Examples Database Growth Bases 45,356,382,990 PDB Content Growth Wellcome Trust: Cardiovascular Functional Genomics Glasgow Shared data BRIDGES IBM Edinburgh Public curated data Leicester Oxford London Netherlands Depends on building & maintaining security, privacy & trust Comparative Functional Genomics Large amounts of data Highly heterogeneous Data types Data forms community Highly complex and inter-related Volatile myGrid Project: Carole Goble, University of Manchester UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign Home Computers Evaluate AIDS Drugs Community = 1000s of home computer users Philanthropic computing vendor (Entropia) Research group (Scripps) Common goal= advance AIDS research From Steve Tuecke 12 Oct. 01 Astronomy Examples Global Knowledge Communities driven by Data: e.g., Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Doubling every 12 months • Largest catalogues near 1B objects Data and images courtesy Alex Szalay, John Hopkins Sloan Digital Sky Survey Production System Slide from Ian Foster’s ssdbm 03 keynote Supernova Cosmology Requires Complex, Widely Distributed Workflow Management Engineering Examples whole-system simulations wing models •lift capabilities •drag capabilities •responsiveness airframe models stabilizer models •deflection capabilities •responsiveness crew capabilities - accuracy - perception - stamina - reaction times - SOP’s engine models human models •braking performance •steering capabilities •traction •dampening capabilities landing gear models •thrust performance •reverse thrust performance •responsiveness •fuel consumption NASA Information Power Grid: coupling all sub-system simulations - slide from Bill Johnson Mathematicians Solve NUG30 Looking for the solution to the NUG30 quadratic assignment problem An informal collaboration of mathematicians and computer scientists Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites) 14,5,28,24,1,3,16,15, 10,9,21,2,4,29,25,22, 13,26,17,30,6,20,19, 8,18,7,27,12,11,23 MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin From Miron Livny 7 Aug. 01 Network for Earthquake Engineering Simulation NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other On-demand access to experiments, data streams, computing, archives, collaboration NEESgrid: Argonne, Michigan, NCSA, UIUC, USC From Steve Tuecke 12 Oct. 01 National Airspace Simulation Environment stabilizer models engine models 44,000 wing runs wing models GRC 50,000 engine runs airframe models 66,000 stabilizer runs ARC LaRC 22,000 commercial US flights a day 48,000 human crew runs human models simulation drivers Virtual National Air Space VNAS 22,000 airframe impact runs • FAA ops data • weather data 132,000 landing/ • airline schedule data take-off gear runs • digital flight data • radar tracks landing gear • terrain data models • surface data NASA Information Power Grid: aircraft, flight paths, airport operations and the environment are combined to get a virtual national airspace Data Challenges It’s Easy to Forget How Different 2003 is From 1993 Enormous quantities of data: Petabytes For an increasing number of communities gating step is not collection but analysis Ubiquitous Internet: >100 million hosts Collaboration & resource sharing the norm Security and Trust are crucial issues Ultra-high-speed networks: >10 Gb/s Global optical networks Bottlenecks: last kilometre & firewalls Huge quantities of computing: >100 Top/s Moore’s law gives us all supercomputers Organising their effective use is the challenge Moore’s law everywhere Instruments, detectors, sensors, scanners, … Organising their effective use is the challenge Derived from Ian Foster’s slide at ssdbM July 03 Tera → Peta Bytes RAM time to move 15 minutes 1Gb WAN move time 10 hours ($1000) Disk Cost 7 disks = $5000 (SCSI) Disk Power 100 Watts Disk Weight 5.6 Kg Disk Footprint Inside machine RAM time to move 2 months 1Gb WAN move time 14 months ($1 million) Disk Cost 6800 Disks + 490 units + 32 racks = $7 million Disk Power 100 Kilowatts Disk Weight 33 Tonnes Disk Footprint 60 m2 May 2003 Approximately Correct See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24 The Story so Far Technology enables Grids, More Data & … Information Grids will dominate Collaboration essential Combining approaches Combining skills Sharing resources (Structured) Data is the language of Collaboration Data Access & Integration a Ubiquitous Requirement Many hard technical challenges Scale, heterogeneity, distribution, dynamic variation Intimate combinations of data and computation With unpredictable (autonomous) development of both Scientific Data Opportunities Global Production of Published Data Volume Diversity Combination Analysis Discovery Opportunities Specialised Indexing New Data Organisation New Algorithms Varied Replication Shared Annotation Intensive Data & Computation Challenges Data Huggers Meagre metadata Ease of Use Optimised integration Dependability Challenges Fundamental Principles Approximate Matching Multi-scale optimisation Autonomous Change Legacy structures Scale and Longevity Privacy and Mobility UK e-Science UK e-Science e-Science and the Grid ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’ ‘e-Science will change the dynamic of the way science is undertaken.’ John Taylor Director General of Research Councils Office of Science and Technology From presentation by Tony Hey e-Science Programme’s Vision UK will lead the in the exploitation of e-Infrastructure New, faster and better research Engineering design, medical diagnosis, decision support, … e-Business, e-Research, e-Design & eDecision Depends on Leading e-Infrastructure development & deployment e-Science and SR2002 Research Council Medical Biological Environmental Eng & Phys HPC Core Prog. Particle Phys & Astro Economic & Social Central Labs 2004-6 £13.1M £10.0M £8.0M £18.0M £2.5M £16.2M £31.6M £10.6M £5.0M 2001-4 (£8M) (£8M) (£7M) (£17M) (£9M) (£15M) + £20M (£26M) (£3M) (£5M) Globus Alliance National eScience Centre National e-Science Institute International relationships Engineering Task Force Grid Support Centre Architecture Task Force OGSA-DAI One of 11 Centre Projects GridNet to support standards work One of 6 administration projects www.nesc.ac.uk Training team 5 Application projects 15 “Fundamental” Research projects EGEE HPC(x) NeSI in Edinburgh National e-Science Centre NeSI Events held in the 2nd Year (from 1 Aug 2002 to 31 Jul 2003) We have had 86 events: (Year 1 figures in brackets) 11 project meetings ( 4) 11 research meetings ( 7) > 3600 Participant Days 25 workshops (17 + 1) Establishing a training 2 “summer” schools (0) team 15 training sessions (8) Investing in community 12 outreach events (3) building, skill generation & knowledge development 5 international meetings (1) 5 e-Science management meetings (7) (though the definitions are fuzzy!) Suggestions always welcome NeSI Workshops Space for real work http://www.nesc.ac.uk/events/ Crossing communities Creativity: new strategies and solutions Written reports Scientific Data Mining, Integration and Visualisation Grid Information Systems Portals and Portlets Virtual Observatory as a Data Grid Imaging, Medical Analysis and Grid Environments Grid Scheduling Provenance & Workflow GeoSciences & Scottish Bioinformatics Forum E-Infrastructure Infrastructure Architecture Data Intensive Users Data Intensive Applications for Application area X Simulation, Analysis & Integration Technology for Application area X Generic Virtual Data Access and Integration Layer Job Submission Brokering Registry Banking Data Transport Workflow Structured Data Integration Authorisation OGSA Resource Usage Transformation Structured Data Access OGSI: Interface to Grid Infrastructure Compute, Data & Storage Resources Structured Data Relational Distributed Virtual Integration Architecture XML Semi-structured - Data Access & Integration Services 1a. Request to Registry for sources of data about “x” SOAP/HTTP Registry 1b. Registry responds with Factory handle service creation API interactions 2a. Request to Factory for access to database Factory Client 2c. Factory returns handle of GDS to client 3a. Client queries GDS with XPath, SQL, etc 3c. Results of query returned to client as XML 2b. Factory creates GridDataService to manage access Grid Data Service XML / Relationa l database 3b. GDS interacts with database Future DAI Services? 1a. Request to Registry for sources of data about “x” & “y” 1b. Registry responds with Factory handle Data Registry SOAP/HTTP service creation API interactions 2a. Request to Factory for access and integration from resources Sx and Sy Data Access & Integration master 2c. Factory returns handle of GDS to client 3b. Client Problem tells“scientific” Solving analyst Client Application Environment coding scientific insights Analyst 2b. Factory creates Semantic GridDataServices network Meta data 3a. Client submits sequence of scripts each has a set of queries to GDS with XPath, SQL, etc GDTS1 GDS GDTS XML database GDS2 Sx 3c. Sequences of result sets returned to analyst as formatted binary described in a standard XML notation Application Code GDS GDS1 Sy GDS3 GDS GDTS2 GDTS Relational database Integration is our Focus Supporting Collaboration Bring together disciplines Bring together people engaged in shared challenge Inject initial energy Invent methods that work Supporting Collaborative Research Integrate compute, storage and communications Deliver and sustain integrated software stack Operate dependable infrastructure service Integrate multiple data sources Integrate data and computation Integrate experiment with simulation Integrate visualisation and analysis High-level tools and automation essential Fundamental research as a foundation Take Home Message Data is a Major Source of Challenges AND an Enabler of New Science, Engineering , Medicine, Planning, … Information Grids Support for collaboration Support for computation and data grids Structured data is fundamental Integrated strategies & technologies needed E-Infrastructure is Here – More to do – technically & socio-economically Join in – explore the potential – develop the methods & standards NeSC would like to help you develop e-Science We seek suggestions and collaboration