UK e-Science: data, data everywhere, ... Sunlabs, Boston Malcolm Atkinson Director www.nesc.ac.uk 7th October 2005 Data, Data Everywhere & … • • • • • • • • • Growing volumes Growing diversity ⇒ Rich resource Growing complexity How do we mine its riches for nuggets of information? Find & Access Understand Extract, Combine & Digest ⇒ Use OGSA-DAI Test hypotheses Bingo! OGSA-DAI 1 Story Line The UK e-Science Context Example Projects Their demand for Data Integration OGSA-DAI An Extensible Framework A kit of parts A way of life Summary & Take Home Messages What is e-Science? Goal: to enable better research in all disciplines 2 What is e-Science? Method: Invention and exploitation of advanced computational methods to generate, curate and analyse research data X X From experiments, observations and simulations Quality management, preservation and reliable evidence to develop and explore models and simulations X X Computation and data at extreme scales Trustworthy, economic, timely and relevant results to enable dynamic distributed virtual organisations X X Facilitating collaboration with information and resource sharing Security, reliability, accountability, manageability and agility Why use / build Grids? Research Arguments Enables new ways of working New distributed & collaborative research Unprecedented scale and resources Economic Arguments Reduced system management costs Shared resources ⇒ better utilisation Pooled resources ⇒ increased capacity Load sharing & utility computing Cheaper disaster recovery 3 Grids: a foundation for e-Science e-Science methodologies will rapidly transform science, engineering, medicine and business driven by exponential growth (×1000/decade) X enabling a whole-system approach computers software Grid sensor nets instruments Diagram derived from Ian Foster’s slide colleagues Shared data archives Why use / build Grids? Computer Science Arguments New attempt at an old hard problem Frustrating ignorance of existing results New scale, new dynamics, new scope Engineering Arguments Enable autonomous organisations to X X X Write complementary software components Set up run & use complementary services Share operational responsibility 4 Why use / build Grids? Political & Management Arguments Stimulate innovation Promote intra-organisation collaboration Promote inter-enterprise collaboration What is e-Infrastructure – Political view A shared resource That enables science, research, engineering, medicine, industry, … It will improve UK / European / … productivity X X Lisbon Accord 2000 E-Science Vision SR2000 – John Taylor Commitment by UK government X Sections 2.23-2.25 Always there X c.f. telephones, transport, power 5 UK e-Science Budget (2001-2006) Total: £213M + £100M via JISC EPSRC Breakdown M RC (£21.1M ) 10% HPC (£11.5M) BBSRC (£18M ) 15% 8% EPSRC (£77.7M ) 37% NERC (£15M ) 7% Applied (£35M) Staff costs 45%Grid Resources Computers & Network Core (£31.2M) 40% (£57.6M ) funded separately PPARC27% CLRC (£10M ) 5% ESRC (£13.6M ) 6% + Industrial Contributions £25M Source: Science Budget 2003/4 – 2005/6, DTI(OST) Slide from Steve Newhouse The e-Science Centres Globus Alliance National Centre for e-Social Science Open Middleware Infrastructure Institute e-Science Institute Grid Operations Support Centre Digital Curation Centre CeSC (Cambridge) OMII-UK EGEE National Institute for Environmental e-Science 6 The Primary Requirement … Building people grids Enabling People to Work Together on Challenging Projects: Science, Engineering & Medicine Service-Oriented Architecture Registry Discovery Registration Invocation Client Service 7 Story so Far UK commits to invest To stimulate academia & industry Differentiated by Coordination Strong leadership from Tony Hey Initially committed to US Grid M/W Invested in Web Services Focus on Data & Information Grids DAI Semantic Grids Application driven development Story Line The UK e-Science Context Example Projects Their demand for Data Integration OGSA-DAI An Extensible Framework A kit of parts A way of life Summary & Take Home Messages 8 Motivation • Entering an age of data – Data Explosion – CERN: LHC will generate 1GB/s = 10PB/y – VLBA (NRAO) generates 1GB/s today – Pixar generate 100 TB/Movie – Storage getting cheaper • Data stored in many different ways – Data resources – Relational databases – XML databases – Flat files • Need ways to facilitate – Data discovery – Data access – Data integration OGSA-DAI • Empower e-Business and e-Science – The Grid is a vehicle for achieving this Composing Observations in Astronomy No. & sizes of data sets as of mid-2002, grouped by wavelength • 12 waveband coverage of large areas of the sky • Total about 200 TB data • Doubling every 12 months • Largest catalogues near 1B objects OGSA-DAI Data and images courtesy Alex Szalay, John Hopkins 9 Biomedical data – making connections 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg © Slide provided by Carole Goble: University of Manchester Database Growth PDB Content Growth Slide provided by Richard Baldock: MRC HGU Edinburgh 10 BRIDGES C F G V ir t u a l P u b lic a lly C u r a te d D a ta E nsem bl O r g a n is a t io n O M IM G la s g o w S W IS S -P R O T P r iv a te E d in b u r g h MGI VO Authorisation P r iv a te d a ta d ata Information Integrator … RGD L e ic e s te r DATA HUB O x fo rd P r iv a te d a ta OGSA-DAI s bla t Synteny Grid Service P r iv a te d a ta H UG O N e th e r la n d s P r iv a te d a ta London P r iv a te d a ta + Slide provided by Richard Sinnott: University of Glasgow eDiaMoND: Screening for Breast Cancer Patients Screening 1 Trust Æ Many Trusts Collaborative Working Audit capability Epidemiology Electronic Patient Records Radiology reporting systems Letters Case Information Assessment/ Symptomatic Biopsy 2ndary Capture Or FFD X-Rays and Case Information Other Modalities -MRI -PET -Ultrasound Symptomatic/Assessment Information eDiaMoND Grid Case and Reading Information Better access to Case information And digital tools SMF CAD Training Case and Reading Information Digital Reading 3D Images Supplement Mentoring With access to digital Training cases and sharing Of information across clinics Manage Training Cases Perform Training SMF CAD Temporal Comparison Provided by eDiamond project: Prof. Sir Mike Brady et al. 11 Business Intelligence and Customer Data Customer Data Customer Order Information SAP Oracle DB2 Siebel Data Grid Customer Support Web-based Dashboard Marketing department identifies likely buyers of new product • Company wants real-time integrated view of customer buying behavior • Data resides in various distributed CRM & ERP systems • Grid allows developers and apps to access and integrate customer data sources together in real time--across many distributed databases OGSA-DAI Slide from: Dave Berry, Andrew Grimshaw – OGSA-WG Providing Data to Cluster-Based Analytical Application R&D West Coast Testing India Engineering East Coast Data Grid Headquarters Illinois Forward Proxy Data Caches of Remote Data Centralized Compute Cluster Data Grid Data Grid Analytical Applications Data Grid • Company has centralized HPC cluster running compute-intensive applications • Source data for analysis distributed among 3 global sites, one of them an external partner • Manual data-sharing processes increase costs/errors, and hinder timeto-results • Grid enables secure, automatic provisioning of remote data to HPC cluster—feeding CPUs more data faster OGSA-DAI Slide from: Dave Berry, Andrew Grimshaw – OGSA-WG 12 Story so far • There is a lot of data – – – – – – growing in every dimension Distributed Many different producers & owners Heterogeneous High value resource Combined it is more valuable S ion t olu ire u eq R s d e lintegration • There are many requirements for data b a t a – Takes many forms pe e – Driven by insights R – Enable conversion of insighticinto tested hypothesis er • There are many data owners en – Their autonomy andG policies must be respected OGSA-DAI Grid Construction Principles Dynamic coupling based on SOA Respect Autonomy – local policies Independent construction & provision Requires adherence to agreed protocols Interoperability Preferably with widely adopted standards Algorithms & Information structures Must scale well Must be tolerant to partial failure Must be tolerant to partial system change Mechanisms to build trust are essential Much more than just providing services Providing mutually consistent services with compatible policies 13 Story Line The UK e-Science Context Example Projects Their demand for Data Integration OGSA-DAI Architectural Context An Extensible Framework A kit of parts A way of life Summary & Take Home Messages Three communities Users: Individual & Organisations Data, Information & Knowledge Providers OGSA-DAI Compute Storage & Communicat ions Providers 14 Three communities Users: Individual & Organisations Data, Information & Knowledge Providers Initiate & Steer work Provide Requirements Use knowledge & insight Want easy & reliable facility Expect agility, reliability, stability & latest techniques Pay the bills OGSA-DAI Compute Storage & Communicat ions Providers Three communities Users: Individual & Organisations Provide & operate resources Storage centres, Data centres, DBMS & File Systems Computation environment, Processing & Communications Need to change facilities & policies Prefer consolidated requirements Must be paid Data, Information & Knowledge Providers OGSA-DAI Compute Storage & Communicat ions Providers 15 Three communities Users: Individual & Organisations Create & Collect data Organise & Structure data Provide, organise & maintain metadata Offer access and domain specific services Establish use policies Will change structure, services & policies May pay or be paid Data, Information & Knowledge Providers OGSA-DAI Compute Storage & Communicat ions Providers Three communities Users: Individual & Organisations OGSA-DAI a framework for a lasting & productive relationship Data, Information & Knowledge Providers OGSA-DAI Compute Storage & Communicat ions Providers 16 OGSA-DAI In One Slide • An extensible framework for data access and integration. • Expose heterogeneous data resources to a grid through web services. • Interact with data resources: – Queries and updates. – Data transformation / compression – Data delivery. • Customise for you project using – Additional Activities – Client Toolkit APIs – Data Resource handlers • A base for higher-level services Slide from Neil Chue Hong – federation, mining, visualisation,… Connecting three communities Users: Individual & Organisations Portals & Applications Client Library ce fa In te r te In Ad ap tiv e ed s ce e & rfa ti v te s ap In ice Ad ble rv i ns Se te & Ex es ac rf id ov Pr Data, Information & Knowledge Providers s OGSA-DAI OGSA-DAI Compute Storage & Communicat ions Providers 17 International Collaboration & Use USA: o Globus Alliance o IBM Corporation o caBIG o BIRN o Indiana University o GridSphere o GEON o LEAD o MCS o NCSA o Secure Data Grid o UNC Tutorials Boston CERN Edinburgh San Francisco Seoul Tokyo UK: o OMII o OMII-UK o NGS o NCeSS Europe: o NIeeS o CERN o AstroGrid o DataMiningGrid o BioSimGrid o GridMiner o BRIDGES o GridSphere o CancerGrid o inteligrid o ConvertGrid o N2Grid o eDiaMonD o OntoGrid o EDINA o Provenance o First Group plc o SIMDAT o Fujitsu Labs Europe o OMII-EU o GEDDM o GeneGrid o Genomic Technology and Informatics o GOLD o Human Genetics Unit Austria o IBM UK 1% o myGrid France o Oracle UK 3% Cambridge Chicago London Seattle Singapore ISSGC 03 to 05 China: Japan: o CAS o AIST o ChinaGrid o BioGrid o cnGrid o NAREGI o INWA o OMII-China South Korea: o KISTI Others 20% Australia: o Curtin Business School o INWA China 40% Italy 2% Japan 5% United Kingdom DIALOGUE workshops 15% Columbus, Edinburgh, Indiana, Vienna 1485 registered users United States Chicago, Manchester, San Diego 11% Germany 3% 5250+ downloads Meeting User Requirements FirstDIG ConvertGrid eDiaMoND GeneGrid BRIDGES LEAD Grid Miner OGSA WebDB OGSA-DQP caBIG 18 OGSA-DAI team EPCC Team, Edinburgh NeSC, Edinburgh NEReSC, Newcastle ESNW, Manchester OGSA-DAI IBM Development Team, Hursley IBM Dissemination Team Inside a Data Service (1/2) Perform document Data Service Response document • Extensible document-based interface: – Change in behaviour ≠> interface change. – Reduce number of operation calls. • Perform document: – Data resource-related activities – query, update, create, compress, transform, deliver. – Data flows between activities. • Response documents: OGSA-DAI – Status of executed activities, result data. 19 DAIS Data Access Database Data Service Consumer SQLDescription: Readable Writeable ConcurrentAccess TransactionInitiation TransactionIsolation Etc. SQLExecute ( SQLExpression ) Relational Database SQLAccess SQLResponse OGSA-DAI DAIS Derived Data Access D ataba se D ata S ervice C onsum er SQ LExecuteFactory ( SQ LE xpression B ehaviouralProperties ) S Q LD escription: R eadable W riteable C oncurrentA ccess TransactionInitiation TransactionIsolation Etc. R elational D atabase SQ LFactory R eference to SQ L R esponse D ata Service R D B M S specific m echanism for generating result set S Q L R esp on se D ata S e rvice G etR ow set ( row setnum ber ) SQ LR esponseD escription R ow Set SQ LR esponseA ccess R ow set OGSA-DAI 20 Inside a Data Service (2/2) Perform document Query Activity query Response document Engine data data Transform Activity Delivery Activity data credentials connection credentials connection role DataResource Mediator role Role Mapper OGSA-DAI Inside a Data Service (2/2) Perform document Query Activity query Response document Engine data data Transform Activity Delivery Activity data credentials connection credentials connection role DataResource Mediator role Role Mapper OGSA-DAI 21 OGSA-DAI Deck of Activities OGSA-DAI Slide from Mario Antonioletti OGSA-DAI Architecture Tx Request External Standard Coordination Tx Response 22 Disciplines organise specialised extension Users: Individual & Organisations Portals & Applications Medical image specialisation Client Library … Astro specialisation … s ce fa In te r te In Compute Storage & Communicat ions Providers a rf Ad ap tiv e ed Data, Information & Knowledge Providers OGSA-DAI s ce e & rfa ti v te s ap In ice Ad ible rv ns Se te … s& Ex ce id ov Pr Ac es i ti it v OGSA-DAI Providing stability Users: Individual & Organisations Portals & Applications ed te In Data, Information & Knowledge Providers … Stable interfaces for resource providers ce s OGSA-DAI fa s ce e & rfa ti v te s ap In ice Ad ble rv i ns Se te … & Ex es ac rf id ov Pr Ac es i ti it v Client Library In te r Stable interfaces for application developers Stable interfaces for activity developers Stable interfaces for data providers Ad ap tiv e Stable interfaces for users OGSA-DAI Compute Storage & Communicat ions Providers 23 Extending capability & interfaces Users: Individual & Organisations Add new operations, more abstraction & composition tools it mun co m Portals & Applications ies s ce fa In te r te In Compute Storage & Communicat ions Providers a rf Ad ap tiv e ed Data, Information & Knowledge Providers s ce e & rfa ti v te s ap In ice Ad ible rv ns Se te … s& Ex ce id ov Pr r u se r& e p o Client Library devel … data, riods Add new functions & d to ated pe e higher abstraction s i po x got e Exploit new capability ces able ne a f r Expose new capability inte n offered by providers o as d r re offered by data ving ol o f r es providers prese i ti OGSA-DAI s v y i a t Alw Ac OGSA-DAI Path towards improvement Users: Individual & Organisations Drive abstraction & composition from metadata Portals & Applications Exploit metadata Automate adaptation Optimise parallel, resilient & pipelined evaluators OGSA-DAI Control system & extension behaviour` fa ce s Negotiate SLAs with providers & obtain better description of resources te In In te r ed Data, Information & Knowledge Providers iv … s ce e & rfa ti v te s ap In ice Ad ble rv i ns Se te … & Ex es ac rf id ov Pr t Ac Client Library Ad ap tiv e Automate adaptation to sustain delivered functions & extension Improve metadata offered by data es providers i ti OGSA-DAI Compute Storage & Communicat ions Providers 24 Story Line The UK e-Science Context Example Projects Their demand for Data Integration OGSA-DAI Architectural Context An Extensible Framework A kit of parts A way of life Summary & Take Home Messages Challenges versus Goals Challenge Heterogeneity & Variety Complex platform behaviour Partial failures Partial failures + large tasks Autonomy – owner’s rights Independent provision Scale, costs & latency Vulnerable to misuse Diverse & Evolving Valuable assets Reputation, equipment, teams, data, algorithms, working practices Goal Simple operational model Simple application model Simple user model Minimal resource wastage Stability & uniformity Simple resource access Good performance Dependable protection Flexible & agile IPR & assets well protected 25 Data Services: challenges • Scale – Many sites, large collections, many uses • Longevity – Research requirements outlive technical decisions • Diversity – No “one size fits all” solutions will work – Primary Data, Data Products, Meta Data, Administrative data, … • Many Data Resources – Independently owned & managed – – – – No common goals No common design Work hard for agreements on foundation types and ontologies Autonomous decisions change data, structure, policy, … – Geographically distributed • and I haven’t even mentioned security yet! Slide from Neil Chue Hong Core features of OGSA-DAI – I • A framework for building applications – Supports data access, insert and update – Relational: MySQL, Oracle, DB2, SQL Server, Postgres – XML: Xindice, eXist – Files – CSV, BinX, EMBL, OMIM, SWISSPROT,… – Supports data delivery – – – – SOAP over HTTP FTP; GridFTP E-mail Inter-service – Supports data transformation – XSLT – ZIP; GZIP – Supports security – X.509 certificate based security Slide from Neil Chue Hong 26 Core features of OGSA-DAI – II • A framework for building data clients – Client toolkit library for application developers • A framework for developing functionality – Extend existing activities, or implement your own – Mix and match activities to provide functionality you need • Highly-extensible – Customise our out-of-the-box product – Provide your own services, client-side support and data-related functionality • Comprehensive documentation and tutorials • Latest release supports GT3.2 (to be deprecated), GT4.0, and Axis 1.2 / OMII_2 using Java 1.4 Slide from Neil Chue Hong E-Infrastructure: the path to Nirvana 1 Users: Individual & Organisations Stability Extended with Improved Abstractions User Adaptation Layer Integration, Sharing, Trust building, Resource Managing Distributed System Infrastructure Data Information & Knowledge Providers New technology Discipline Specific Toolsets Languages & Portals User Mobility Compute Storage & Communica tions Providers 27 E-Infrastructure: the path to Nirvana 2 Users: Individual & Organisations Stability + Higher-order Operations + Improved (semantic) Meta data User Adaptation Layer Integration, Sharing, Trust building, Resource Managing Distributed System Infrastructure Compute Storage & Communica tions Providers Data Information & Knowledge Providers E-Infrastructure: the path to Nirvana 3 Users: Individual & Organisations Increasing Self-organisation & Autonomic Behaviour Data Information & Knowledge Providers User Adaptation Layer Integration, Sharing, Trust building, Resource Managing Distributed System Infrastructure Automated Generation Of Stability Adapters & Translators Compute Storage & Communica tions Providers 28 Summary: Take home message E-Infrastructure is arriving Built on Grids & Web Services Data and Information grow in importance There is a dramatic rate of change An opportunity for everyone 29 Distributed Systems to Grids Bespoke pioneering distributed systems, e.g. SABRE & SAGE Linklider proposes shared multi-site computing ARPA net IBM CICS Ethernet TCP Dozens of academic networks Two dominant protocols CORBA & DCOM IP-based Internet Academic & Research WWW 1960 1970 1980 1990 2000 Distributed Systems to Grids Bespoke pioneering distributed systems D-Grid starts Many research grids Linklater proposes shared multi-site computing Using many protocols & M/W stacks IBM CICS Web Services ARPA net Unicore networks Dozens of academic Two dominant protocols Globus CORBAStarts & DCOM Collaboration via shared IP-based Internet bio/chem/medical I-way Academic & Research “DBs” Condor WWW 1960 1970 1980 1990 2000 30