UK e-Science Future Infrastructure for Scientific Data Mining, Integration and Visualisation Malcolm Atkinson Director of National e-Science Centre www.nesc.ac.uk 25th October 2002 SDMIV workshop, e-Science Institute Edinburgh Overview UK e-Science Reminder of Investment and Infrastructure International e-Science Examples and Collaboration Data Access and Integration Lego Bricks for Scientific Application Developers Tailored: Application and Computing Scientists A Computer Scientist’s Christmas List Diversity and Opportunity The Way Ahead e-Science Fundamentally about Collaboration Sharing Ideas Thought processes and Stimuli Effort Resources Requires Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure Scientists (Biologists) have done this for Centuries e-Science (take 2) Fundamentally about Collaboration structured, organised & Text, digital media, Sharing Ideas Thought processes and Stimuli Effort Resources Requires curated data, computable models, visualisation, shared instruments, shared systems, shared administration, … Nationally & Internationally Distributed, … Routine, Daily, Automated, … Communication Common understanding & Framework Mechanisms for sharing fairly Organisation and Infrastructure That Requires very Significant Investment in Digital Systems and their Support e-Science (take 3) Digital networks, digital workFundamentally about Collaboration Sharing Ideas Thought processes and Stimuli Effort Resources Requires places, digital instruments, … Metadata, ontologies, standards, shared curated data, shared codes, … Common platforms, shared software, shared training, … Communication Common understanding & Framework Citation, Authentication, Authorisation, Accounting, Mechanisms for sharing fairly Provenance, Policies, … Organisation and Infrastructure Shared Provision of Platform, The Grid SHOULD make this much easier by providing a common, supported high-level of Software and Organisational infrastructure Grid Expectations Persistence Always there, Always Working, Always Supported Stability You can build on foundations that don’t move Trustworthy & Predictable Honours commitments Digital policies, digital contracts, security, … Data integrity, longevity and accessibility Performance High-level & Extensible The capabilities you need are already there Ubiquitous Your collaborators use it Grid Reality Persistence Political, Economic & Technical issues to Solve Always there, Always Working, Always Supported Early days but Open Grid Stability Services link with Web Services + GGF standardisation You can build on foundations that don’t move Trustworthy & Predictable Not yet but very substantial global effort to achieve this Honours commitments Digital policies, digital contracts, security, … Data integrity, longevity and accessibility Performance High-level & Extensible Good basis for extension Commitment to basic functionality WS + Community effort The capabilities you need are already there Ubiquitous Your collaborators use it Global & Industrial Rallying Cry Must work with Web Services UK Grid Network National eScience Centre Access Grid always-on video walls HPC(x) Edinburgh Glasgow Newcastle Belfast Daresbury Lab Manchester Cambridge Hinxton Oxford Cardiff RAL London Southampton SuperJanet4, June 2002 Scotland via Glasgow 20Gbps 10Gbps 2.5Gbps 622Mbps 155Mbps Scotland via Edinburgh WorldCom Glasgow WorldCom Edinburgh NNW NorMAN YHMAN Northern Ireland MidMAN WorldCom Manchester WorldCom Leeds EMMAN WorldCom Reading WorldCom London EastNet TVN South Wales MAN WorldCom Bristol External Links WorldCom Portsmouth LMN SWAN& BWEMAN Tony Hey July 2001 LeNSE Kentish MAN National e-Science Centre Events Workshops Research Meetings International Meetings History of Events GGF5 HPDC11 Summer school > 50 workshops held > 1000 people in total Many return often Planned Events 25 workshops Conferences to 2005 Visitors 3 arrived 4 arranged International collaboration, visits & visitors China Argonne National Lab SDSC NCSA … Centre Projects Pilot Projects Regional Support Research Projects EPSRC, MRC, WT, SHEFC UCSF UIUC From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign DataGrid Testbed Testbed Sites(>40) HEP sites ESA sites Dubna Lund Moscow RAL Estec KNMI Berlin IPSL Paris Santander Lisboa CERN Prague Brno Lyon Grenoble Milano PD-LNL Torino Madrid Marseille Pisa BO-CNAF Barcelona ESRIN Roma Valencia Catania Francois.Etienne@in2p3.fr - Antonia.Ghiselli@cnaf.infn.it A Simplified Grid Anatomy Scientific Users Scientific Application Monitoring Diagnosis Logging Scheduling Accounting Authorisation Application Developers Grid Plumbing & Security Infrastructure Operations Owners Data & Compute Resources Team Distributed The Crux Scientific Users Scientific Application Monitoring Diagnosis Logging Application Developers Keep all the (pink) groups Authorisation Scheduling Accounting HAPPY Grid Plumbing & Security Infrastructure Operations Owners Data & Compute Resources Team Distributed A SDMIV Grid Anatomy SDMIV Users Scientific Application Monitoring Diagnosis Scheduling Accounting Logging Data Integration Authorisation Data Access Grid Plumbing & Security Infrastructure Data & Compute Resources Distributed Structured DataData Providers Data Curators Database Growth PDB protein structures Data Mining: Science vs Commerce Data in files FTP a local copy /subset. ASCII or Binary. Each scientist builds own analysis toolkit Analysis is tcl script of toolkit on local data. Some simple visualization tools: x vs y Data in a database Standard reports for standard things. Report writers for non-standard things GUI tools to explore data. Decision trees Clustering Anomaly finders Jim Gray UCSC April 2002 But…some science is hitting a wall FTP and GREP are not adequate You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~10,000 disks You can FTP 1 MB in 1 sec You can FTP 1 GB / min (= 1 $/GB) … 2 days and 1K$ … 3 years and 1M$ 50,000 Kg 250 KW 60 Racks = 120m2 At some point you need indices to limit search parallel data search and analysis This is where databases can help Jim Gray UCSC April 2002 OGSA & OGSI Web Services Grid Technology Grid Services www.gridforum.org/ogsi-wg www.gridforum.org/ogsa-wg www.gridforum.org/ Web Services Rapid Integration Dynamic binding Commercial Power Financial & Political Independence Client from Service Service from Client Separation Function from Delivery Description WSDL, WSC, WSEF, … Tools & Platforms Java ONE, Visual .NET WebSphere, Oracle, … www. w3c. org / TR / SOAP or TR/wsdl Grid Technology Virtual Organisations Sharing & Collaboration Security Single Sign in, delegation Distribution & fast FTP But Various Protocols Resource Mangement Discovery Process Creation Scheduling Monitoring Portability Ubiquitous APIs & Modules Gov’nm’t Agency Buy in Industrial Buy in Foster, I., Kesselman, C. and Tuecke, S., The Anatomy of the Grid: Enabling Virtual Organisations, Intl. J. Supercomputer Applications, 15(3), 2001 http://www.gridforum.org/ogsi-wg Open Grid Services Architecture Applications Using operations Virtual Grid Services Implemented by Multiple implementations of Grid Services OGS infrastructure Foster, I., Kesselman, C., Nick, J. and Tuecke, S., The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration Scientific Data Deluge of Data Exponential growth Doubling times Astronomy Bio-Sequences Functional Genomics Bytes/dollar 12 months 9 months 6 months 12 to 18 months Not How big it is but Scientific Data Deluge of Data Exponential growth Doubling times Astronomy Bio-Sequences Functional Genomics Bytes/dollar 12 months 9 months 6 months 12 to 18 months Not How big it is but What you do with it Sharing Curation Metadata Automated movement, access & integration Computational Access Scientific Data Deluge of Data Exponential growth Doubling times Astronomy Bio-Sequences Functional Genomics Bytes/dollar 12 months 9 months 6 months 12 to 18 months Not How big it is but How you Embrace & Manage Change The Database is a Knowledge chest The Database is a Communication Hub Autonomously Managed (Curated) change An Essential part of e-BioMedical, Astronomical, …, Science & Engineering Wellcome Trust: Cardiovascular Functional Genomics Glasgow Shared data Edinburgh Public curated data Leicester Oxford London Netherlands Data Access & Integration Central to e-Science Astronomy, Earth Sciences, Ecology, Biology, Medicine, … Collaboration Shared Databases Curated Knowledge Accumulated Observations Accumulated Simulations Computation Data mining Input to models Calibration of models Presentation Publication of results Visualisation GGF DAIS WG Chairs Norman Paton (Manchester Uni.) Leanne Guy (CERN) Dave Pearson (Oracle UK) Activity BoF GGF4 Toronto WG Meeting GGF5 Edinburgh Papers for GGF6 Workshops & Mail lists Goals Norman Paton, Inderpal Narang, Leanne Guy, Susan Maliaka, Greg Ricardi, … Agree Standards for Database Access & Integration Freely available reference implementations OGSA-DAI one source & focus for discussions http://www.cs.man.ac.uk/grid-db/ OGSA-DAI project Lego kit for Data Access & Integration Components for e-Science Applications Accelerated Application Development Multiple Data Models Distributed Data Access via Grid & Proxies Integration, Translation & Transformation Open Source Reference Implementation For DAIS-WG standard Trigger for Component Construction Start a community OGSA-DAI Partners IBM USA EPCC & NeSC Glasgow Newcastle Belfast Daresbury Lab Manchester Oxford Cambridge EPCC & NeSC Oracle Hinxton RAL IBM UK Cardiff London IBM Hursley IBM USA Southampton Manchester e-SC Newcastle e-SC £3 million, 18 months, started February 2002 Oracle Primary Components GDSF Client GDS DB Consumer GDSR Advanced Components Translation Client GDS:PerformScript GDS DB Translation GDT Consumer Composed Components GDS:performScript Translation GDS:performScript GDS Client GDS:performScript GDT Translation GDS:performScript GDT GDT Consumer Composing Components Data Transport OGSA-DAI Component Data Transport OGSA-DAI Component Data Transport OGSA-DAI Component Data Transport DAI Key Components GridDataService GDS Access to data & DB operations GridDataServiceFactory GDSF Makes GDS & GDSF GridDataServiceRegistry GDSR Discovery of GDS(F) & Data GridDataTranslationService Translates or Transforms Data GridDataTransportDepot GDTD Data transport with persistence Relational & XML models supported Role-based Authorisation Binary structured files OGSA Relationship Class GridService GDS Registry NotificationConsumer NotificationProducer Mandatory Optional Normal GDSF Mandatory Optional Normal GDSR Mandatory GDTS Mandatory GDTD Mandatory Mandatory Normal Optional Normal DAI portType Usage Class GridDataService DataTransport GDS Mandatory Normal GDSF Optional Normal GDSR Optional GDTS Optional Mandatory GDTD Optional Mandatory Factory Mandatory Distributed Query R F Registry Factory GDS 6 GDS 1 5 4 Client 7 Evaluator 3 PNM 6 GDTV GDT DQP GDS 2 DB GDTV GDTV GDS 5 T GDTV 7 QPM Q 7 (7) 8 Consumer GDT NS GDT Evaluator GDTV 7 6 T GDT 5 T Evaluator DQP : Distributed Query Processor GDT : Grid Data Transport T : Translation Q : Query GDTV : Grid Data Transport Vehicle F : Factory QPM : Query Progres Monitor PNM : Progress Notification Message AM : Application Metadata CRM : Computational Resource Metadata NS : Notification Sink PNM GDTV 7 GDT GDS T GDTV 7 OGSA-DAI Time Line WS + GSI UK support ( > 100 downloads) XML + OGSA Prototypes for Early Adopters Design Documents & Demos for DAIS WG @ GGF5 XML + OGSA Prototype Available RDB + GT2 / OGSA Prototypes Available GGF6 WG Papers & Prototypes Ship Alpha Release for GT3 Integration Presentation & Beta @ GGF7 Productisation, RAMPS & Extension Feb ’02 May ’02 Phase 1 Starts Jul ’02 Sep ’02 Dec ’02 Phase 2 Starts Feb ’03 May ’03 Sep ’03 OGSA-DAI Summary On Schedule & Going Well Contributions via DAIS-WG @ GGF5 & 6 Releases with GT3 Releases scheduled Status: Early Days Released prototypes Tested Architectural Design Using OGSA Working with Early Adopter Pilot Projects AstroGrid & MyGrid First PRODUCT release Dec ‘02 Influence OGSA-DAI direction Via DAIS-WG & Direct messages to us Data Processing Archive Archive Reference Data Instrument Raw Data Multi-stage Processed Processing Data In Silico Processing Characteristics -Well defined work flow -Correction, calibration, transformation,filtering, merging -Relatively static reference data -Stable processing functions (audited changes) -Periodic reprocessing from archive Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago Analysis and Interpretation Archive Summarisation Processed Data Analysis Characteristics - Variable workflow - Standard functions - Standard and personal filtering and summarisation - Retain drill down capability Summarised Data Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago Analysis and Interpretation Personalised Database Conclusions/Inferences - Descriptions - Trends - Correlations - Relationships Summarised Data Processed Data Result data Retrieval & Update Analysis and Interpretation Characteristics - Highly dynamic work flow - Multiple data types - Volatile data - Annotations, inferences, conclusions - Evidential reasoning - Shared multiple versions of truth - Periodic version consolidation Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago Metadata Requirements Technical Metadata Direct referencing - Physical location and data schema/structure Data currency/status – version, time stamping Accreditation/Access permissions - Ownership (Dublin Core) Query time/Governance - data volume, no. of records, access paths Contextual Metadata Logical referencing physical data – semantic/syntactic ontologies Lexical translation – Thesaurus, ontological mapping Named derivations (summarisations) Scope of Requirements All science communities Related to provenance Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago Metadata Requirements Data Versioning Distinguish latest/agreed version of data Maintain history record of change Synchronise and mirror replicated data Distinguish shared personal interpretations and/or annotations Provenance Record of data processing – calibration, filtering, transformation Record of workflow – methods, standards and protocols Reasoning – evidential justification for inferences & conclusions Scope of Requirements All science communities Includes Technical and Contextual Metadata Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago Provenance Issues Schema evolution Granularity of record Processed v Derived Inheritance Lack of structured annotations, ontologies Interactive analysis = dynamic workflow Multiple derived data sources Context of usage Best practice can change Multiple versions of the truth Evidential reasoning Existing data & applications Where is the provenance record stored Dave Pearson Provenance and Derivation workshop 18 Oct 02, Chicago Collaborative Annotation See DAS Distributed Annotation Service Challenges Autonomy Selective viewing Identification Provenance Derivation Biomedical e-Scientists Is this one species? Understanding bird energy Understanding a river / ocean interaction Understanding a biochemical pathway Understanding a cell Understanding a Heart or Brain Understanding Rhododendra Understanding Evolution … No One-Size fits all solutions But sharable re-usable components Opportunities Many, many … More than we can address Compute needs Data management needs Data integration needs … Must choose some pioneers To meet a range of common requirements To provoke rich & high-level platform To generate re-usable components A Long-Term Commitment Needed Advancing SDMIV Grid SDMIV Users Scientific Application SDMIV (Grid) Application Component Library Monitoring Diagnosis Scheduling Accounting Logging Data Integration Authorisation Data Access Grid Plumbing & Security Infrastructure Data & Compute Resources Distributed Structured Data Summary e-Science Data as well as Compute Challenges Needed to be put together Need ubiquitous supported consistent platforms Grid A (potentially) invaluable platform Only show in town Data Integration Hard Develop & Use Standard kit of parts Started to build the kit No ready made general integration Combines application & computing science Opportunities No one-size fits all, but re-usable subsystems Invest in wider range of Problem driven pioneering Strategic choices needed