Managing Scientific Information Making the Internet Work for Big Science Professor Greg Riccardi Florida State University Department of Computer Science, UK National e-Science Centre 1 Overview Information on the Web—Current Status Providing Information on the Internet General Conditions of the Web and Internet Resources Needed for Big Science What is The Grid? How the Grid Might Support Databases Computer Science Challenges and Research Opportunities 23 Oct, 2002 2 Is this the Internet? 23 Oct, 2002 3 Edinburgh on a Normal Day 23 Oct, 2002 4 23 Oct, 2002 5 Can We Find Web Information? Use Google to find travel times from Edinburgh to Aberdeen Why was no usable information returned? Vocabulary problem Search for “railroad times Aberdeen Edinburgh” “railroad” is not a service (in UK) Search for “train times Aberdeen Edinburgh” Even better, use a ticket service 23 Oct, 2002 GNER.co.uk Thetrainline.com 6 Finding Information on the Web Consider comparative shopping Example of pricewatch.com Provide a capability to compare prices Allow people to see pages of prices Prices of memory Where do prices come from? Can you extract the information content from the Web pages? How can we make the Web provide information? 23 Oct, 2002 Can we establish a way to share price info? Will vendors participate? 7 XML Creates Opportunity Possibility: Use XML to represent information Strategy for sharing <item type=“PC3500 DDR” <vendor name=“memorylabs.com”/> <manuf name=“Samsung”/> <size>512</size> <price>169.00</price> </item> Industry creates standard XML schema Vendors create files of prices Comparison shopping sites grab files and create presentations of information Purchasing agents How would shopping sites and purchasing agents find the sources of information? Would vendors agree to publish? 23 Oct, 2002 8 Semantic Web: Information on Web The Semantic Web Tim Berners-Lee’s idea: definition from http://www.w3.org/2001/sw/ Resource Description Framework (RDF) is an emerging standard for representing Web resources The Semantic Web is the abstract representation of data on the World Wide Web, based on the RDF standards and other standards to be defined. http://www.w3.org/RDF/ The semantic Web requires sites to provide documents marked up to define information content 23 Oct, 2002 I.e. XML documents With an agreed ontology 9 Can the Semantic Web Work? According to Henry S. Thompson, U. Edinburgh Talk given at Global Grid Forum July 2002 The Semantic Web is based on metadata Metadata describes resources systematically Suppliers can record what a document or resource is for or Search engines can work with meaningful information about What would we need to make Semantic Web work? 23 Oct, 2002 A standard syntax for metadata One or more standard vocabularies, Allow search engines, producers, and consumers to speak the same language Lots of documents and resources with metadata attached Attribution and trust Access and security 10 Web Services Machine-to-machine exchange of information Metadata defined with Web Services Description Language (WSDL) Web servers deliver XML in response to HTTP requests Gives structure of information content of a service UDDI commercial registry for services 23 Oct, 2002 Possible use: try to create a contract to process 1000 credit card transactions Look for services, ask for prices, negotiate, etc. Microsoft, IBM, ebXML 11 Discovering Web Services Again according to Henry S. Thompson, U. Edinburgh The crucial missing step is the inference engine The Semantic Web cannot work without inference Information sources say what they can provide Information users say what they want The 2 specifications are not obviously related! The user must be able to 23 Oct, 2002 Find the resources Determine their suitability (and cost) Create a request in the proper form Process the returned data 12 The reality of Web Services Quoting Henry S. Thompson, U. Edinburgh Forget the headline stuff Cars negotiating with petrol stations Agents choosing a specialist based on available appointment slots The focus in practice is on exploiting the move to asynchronous distributed applications Within the enterprise, not between enterprises Using pre-negotiated vocabularies, and little or no discovery IT-intensive enterprises see Web Services primarily as a way to reduce their EAI/middleware bills Big Science needs more than Web Services 23 Oct, 2002 Portal technology for presentation of information to people Open Grid Services Architecture (OGSA) to extend Web Services with capabilities needed for scientific collaborations 13 Portal Technology A portal is a Web page that collects and presents information from many sources Tailored for needs of publisher Tailorable for needs of consumer JetSpeed Apache Portal Project 23 Oct, 2002 FSU K12 Education portal http://edtech.oddl.fsu.edu:8080/K12/ Indiana Community Grids portal http://ptlportal.communitygrids.iu.edu/portal/ 14 Big Science and Grid Technology Big Science Biggest emerging problems How to make people efficient Example science fields Massive data Massive computing Geographic distribution of people Geographic distribution of resources Economics, earth sciences, astronomy, mechanical engineering, aerospace, bioinformatics, medicine, Grid Technology 23 Oct, 2002 Software support for heterogeneous distributed applications 15 What Does Big Science Need? Massive amounts of computing and storage Tools that are easy to use Distribution of computing and storage facilities Efficient movement of massive data sets Location-independent processing Management of data and computation Control over software development Discovery of computing and information resources Collaboration between geographically distributed people 23 Oct, 2002 16 How Big is Big? What is happening to data sizes? 1990 at Jefferson Lab, USA: Planning for new facility Estimated 10 megabytes/sec sustained data rate from equipment in 1998 1 terabytes per day 200 terabytes per year Data storage on £5 million tape silo Tape costs of £200,000 per year Consumer examples Data from digital cameras 4 megapixel, 3 bytes per pixel 12 megabytes/picture Compressed much less Data from digital video cameras 4 megabytes/sec Kazaa digital DVD video sharing 23 Oct, 2002 See effect on networking at Florida State University 17 Peer to Peer Networking (P2P) P2P is sharing resources Future DVD Distribution Scheme Directories of services Centralized access to directories Directory search engines Distributed interaction between Client and Server Exploit client-server locality (Sun JXTA) Mobile telephones plan to remove towers First purchasers buy and download from central site Subsequent purchasers download P2P Everyone must buy license before using Be the first on your block to buy the new movie 23 Oct, 2002 Server receives payment for every download Cable broadband has shared local bus structure 18 Jefferson Lab Hall D in 2007 23 Oct, 2002 19 Atlas Detector at CERN 23 Oct, 2002 Hall D at Jefferson Lab is not very big 20 Computation and Data Rates for Hall D Expected date for full data collection 2007 Raw data collection and analysis 23 Oct, 2002 15,000 events per second, 5 KB/event, 75 MB/sec At 1/3 duty factor, .75 PB/year To analyze each event twice, 50 CPUs Analyzed data 1.5 PB/year Computational simulations of experiments Estimates based on expected CPU speeds Expect to need 5,000 events/sec Expect .1 CPU-sec/event Need 500 CPUs Simulated data .75 PB/year Total data rate of 3 petabytes per year 21 Hall D Computing Tasks Calibrations Acquisition Data Archival Data Mining Physics Analysis Monitoring Slow Controls First Pass Analysis Planning Partial Wave Analysis Physics Analysis Simulation Publication 23 Oct, 2002 22 Hall D Collaboration Map 23 Oct, 2002 23 Meeting Computational Challenges Moore’s law: Computer performance increases by a factor of 2 every 18 months. Gilder’s Law: Network bandwidth triples every 12 months. Dennis’s Law: Neither Moore’s Law nor Gilder’s Law will solve our computing problems. Solving the information management problems requires people working on the software and developing a workable computing environment. 23 Oct, 2002 24 Database Research is not Dead Consider relative speeds of devices Pentium 120, circa 1996 33 mhz memory bus 10 GByte disk Today’s Pentium 4 120 mhz processor 64 mbyte memory 10 mbit/sec ethernet 2800 mhz processor (x 24) 1 GByte memory (x 16) 100 mbit/sec ethernet (x 10) 400 mhz system bus (x 12) 300 GByte disk (x 30) The speed and size of data storage is far outstripping the speed of processors Hence: Data management is becoming more and more important 23 Oct, 2002 25 What Scientists Need Self-Sustaining Infrastructure Adequately Described Components Function, Behaviour, QoS, … Models Supporting Analysis & Reasoning With regular well defined structure Finding appropriate components Determining how they compose Tools for Composition, Diagnosis & Change Sustainable Economic Model Reason to Trust the System’s Dependability 23 Oct, 2002 26 Grid Technology Virtual Organisations Security Discovery Process Creation Scheduling Monitoring Portability But Various Protocols Resource Management Single Sign in, delegation Distribution & fast FTP Sharing & Collaboration Ubiquitous APIs & Modules Government Agency Buy in 23 Oct, 2002 Foster, I., Kesselman, C. and Tuecke, S., The Anatomy of the Grid: Enabling Virtual Organisations, Intl. J. Supercomputer Applications, 15(3), 2001 27 Grid Computing Advantages Provide identical access for all collaborators Utilize all intellectual resources Maximize total funding resources while meeting the total computing need Reduce systems’ complexity Partitioning of facility tasks, to manage and focus resources Optimize computing resources to solve problems JLab, universities, remote sites Scientists, students Tier-n or “Grid” Model Reduce long-term computational management problems 23 Oct, 2002 28 Foundations for Grid Sites Interactive Services Needs Very Reliable Hardware & Software at Remote Sites Computing Services 23 Oct, 2002 Batch Services Grid Services Data Services Needs Very Reliable, Easy to Install Software at Remote Sites Information Services 29 Open Grid Services Features WSDL + WSIL Apache Tomcat/Axis Globus SOAP RPC EJB Representations 23 Oct, 2002 XML + Schema Life Time Management Invocation Description Discovery Tools & Platforms Authentication Factories Transient & Persistent GS GS Handles GS Records Soft State Notification Certificates + Delegation Change Management Platform Independence Foster, I., Kesselman, C., Nick, J. and Tuecke, S., The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration 30 Grid Data Services OGSA-DAIS Common access for Physicists everywhere. Utilizing all intellectual resources Maximize total funding resources while meeting the total computing need. Reduce Systems’ complexity Partitioning of facility tasks, to manage and focus resources. Optimization of computing resources to solve the problem. JLab, universities, remote sites Scientists, students Tier-n or “Grid” Model. Reduce long-term computational management problems. 23 Oct, 2002 31 Sample Grid Data Service System Registry Factory Metadata about Services and Factories Delivers GSH of factory Grid Service Registry Grid Data Service Factory <find factory> <Grid Data Service GSH> Client Grid Data Service Provides access to database Create Grid Data Service <create Grid Data Service> <factory GSH> Metadata about Services Creates GDS and returns its GSH Grid Data Service Database <data request> <data document> 23 Oct, 2002 32 Requesting Data from a GDS Grid Data Service Requester A <query specification> GridDataService Port <data document> <GDSS> <Body> <Statement name =”xyz1”> SELECT m1, m2, m3 FROM T1 WHERE … </Statement> <Delivery> <From> xyz1 </From> <To> GSH of C </to> </Delivery> <Execute>xyz1</Execute> </Body> </GDSS> 23 Oct, 2002 33 Request with Separate Delivery A makes request B and C receive resulting data Requester A <query specification> Grid Data Service <query response> query identifier <transport specification> Requester B <data document> GridDataService Port <transport specification> Requester C 23 Oct, 2002 <data document> 34 CS Challenges and Research Semantic Web Grid Ontologies Inference engines XML databases Efficient access to XML resources Architecture: OGSI Transport Repositories and replication Security and authorization OGSA-DAI 23 Oct, 2002 Registries and discovery Integration of DB and scripting languages XML database update and query Distributed query processing 35 Web References Web search Web Services Sun JXTA: http://www.jxta.org/ Kazaa: http://www.kazaa.com National e-Science Centre http://www.w3.org/2001/sw/ and RDF/ ebXML: http://www.ebxml.org/ Microsoft UDDI repository: http://uddi.microsoft.com IBM UDDI: http://www7b.boulder.ibm.com/wsdd/downloads/UDDIregistry.html Peer to Peer http://www.pricewatch.com http://www.google.com http://www.mysimon.com Home: http://www.nesc.ed.uk/ Talks: http://umbriel.dcs.gla.ac.uk/NeSC/general/presentations/ Grid and Grid Computing 23 Oct, 2002 Grid Forum: http://www.gridforum.org OGSA: http://www.gridforum.org/ogsi-wg/ OGSA-DAIS: http://www.gridforum.org/ogsi-wg/ 36