Managing Scientific Information Making the Internet Work for Big Science

advertisement
Managing Scientific Information
Making the Internet Work for Big Science
Professor Greg Riccardi
Florida State University
Department of Computer Science,
UK National e-Science Centre
1
Overview
„
„
„
„
„
„
„
Information on the Web—Current Status
Providing Information on the Internet
General Conditions of the Web and Internet
Resources Needed for Big Science
What is The Grid?
How the Grid Might Support Databases
Computer Science Challenges and Research
Opportunities
23 Oct, 2002
2
Is this the Internet?
23 Oct, 2002
3
Edinburgh on a Normal Day
23 Oct, 2002
4
23 Oct, 2002
5
Can We Find Web Information?
„
Use Google to find travel times from Edinburgh
to Aberdeen
„
„
„
Why was no usable information returned?
Vocabulary problem
„
„
„
Search for “railroad times Aberdeen Edinburgh”
“railroad” is not a service (in UK)
Search for “train times Aberdeen Edinburgh”
Even better, use a ticket service
„
„
23 Oct, 2002
GNER.co.uk
Thetrainline.com
6
Finding Information on the Web
„
Consider comparative shopping
„
„
„
Example of pricewatch.com
„
„
„
„
Provide a capability to compare prices
Allow people to see pages of prices
Prices of memory
Where do prices come from?
Can you extract the information content from
the Web pages?
How can we make the Web provide information?
„
„
23 Oct, 2002
Can we establish a way to share price info?
Will vendors participate?
7
XML Creates Opportunity
„
Possibility: Use XML to represent information
„
„
Strategy for sharing
„
„
„
„
„
„
<item type=“PC3500 DDR”
<vendor name=“memorylabs.com”/>
<manuf name=“Samsung”/>
<size>512</size>
<price>169.00</price>
</item>
Industry creates standard XML schema
Vendors create files of prices
Comparison shopping sites grab files and create presentations of
information
Purchasing agents
How would shopping sites and purchasing agents find
the sources of information?
Would vendors agree to publish?
23 Oct, 2002
8
Semantic Web: Information on Web
„
The Semantic Web
„
Tim Berners-Lee’s idea: definition from
http://www.w3.org/2001/sw/
„
„
Resource Description Framework (RDF) is an
emerging standard for representing Web resources
„
„
The Semantic Web is the abstract representation of data
on the World Wide Web, based on the RDF standards and
other standards to be defined.
http://www.w3.org/RDF/
The semantic Web requires sites to provide
documents marked up to define information
content
„
„
23 Oct, 2002
I.e. XML documents
With an agreed ontology
9
Can the Semantic Web Work?
„
According to Henry S. Thompson, U. Edinburgh
„
„
„
„
Talk given at Global Grid Forum July 2002
The Semantic Web is based on metadata
Metadata describes resources systematically
„
Suppliers can record what a document or resource is for or
„
Search engines can work with meaningful information
about
What would we need to make Semantic Web work?
„
„
„
„
23 Oct, 2002
„
A standard syntax for metadata
One or more standard vocabularies,
„ Allow search engines, producers, and consumers to speak
the same language
Lots of documents and resources with metadata attached
Attribution and trust
Access and security
10
Web Services
„
Machine-to-machine exchange of information
„
„
Metadata defined with Web Services Description
Language (WSDL)
„
„
Web servers deliver XML in response to HTTP
requests
Gives structure of information content of a service
UDDI commercial registry for services
„
„
„
23 Oct, 2002
Possible use: try to create a contract to process 1000
credit card transactions
Look for services, ask for prices, negotiate, etc.
Microsoft, IBM, ebXML
11
Discovering Web Services
„
Again according to Henry S. Thompson, U. Edinburgh
„
„
The crucial missing step is the inference engine
„
„
„
„
The Semantic Web cannot work without inference
Information sources say what they can provide
Information users say what they want
The 2 specifications are not obviously related!
The user must be able to
„
„
„
„
23 Oct, 2002
Find the resources
Determine their suitability (and cost)
Create a request in the proper form
Process the returned data
12
The reality of Web Services
„
Quoting Henry S. Thompson, U. Edinburgh
„
„
„
„
Forget the headline stuff
„ Cars negotiating with petrol stations
„ Agents choosing a specialist based on available appointment
slots
The focus in practice is on exploiting the move to asynchronous
distributed applications
„ Within the enterprise, not between enterprises
„ Using pre-negotiated vocabularies, and little or no discovery
IT-intensive enterprises see Web Services primarily as a way to
reduce their EAI/middleware bills
Big Science needs more than Web Services
„
„
23 Oct, 2002
Portal technology for presentation of information to people
Open Grid Services Architecture (OGSA) to extend Web Services
with capabilities needed for scientific collaborations
13
Portal Technology
„
A portal is a Web page that collects and
presents information from many sources
„
„
„
Tailored for needs of publisher
Tailorable for needs of consumer
JetSpeed Apache Portal Project
„
„
23 Oct, 2002
FSU K12 Education portal
„ http://edtech.oddl.fsu.edu:8080/K12/
Indiana Community Grids portal
„ http://ptlportal.communitygrids.iu.edu/portal/
14
Big Science and Grid Technology
„
Big Science
„
„
„
„
„
Biggest emerging problems
„
„
How to make people efficient
Example science fields
„
„
Massive data
Massive computing
Geographic distribution of people
Geographic distribution of resources
Economics, earth sciences, astronomy, mechanical
engineering, aerospace, bioinformatics, medicine,
Grid Technology
„
23 Oct, 2002
Software support for heterogeneous distributed
applications
15
What Does Big Science Need?
„
Massive amounts of computing and storage
„
„
„
„
Tools that are easy to use
„
„
„
„
Distribution of computing and storage facilities
Efficient movement of massive data sets
Location-independent processing
Management of data and computation
Control over software development
Discovery of computing and information
resources
Collaboration between geographically distributed
people
23 Oct, 2002
16
How Big is Big?
„
What is happening to data sizes?
„
„
1990 at Jefferson Lab, USA: Planning for new facility
„ Estimated 10 megabytes/sec sustained data rate from
equipment in 1998
„ 1 terabytes per day
„ 200 terabytes per year
„ Data storage on £5 million tape silo
„ Tape costs of £200,000 per year
Consumer examples
„
Data from digital cameras
„ 4 megapixel, 3 bytes per pixel
„
12 megabytes/picture
Compressed much less
Data from digital video cameras
„ 4 megabytes/sec
„
„
„
Kazaa digital DVD video sharing
„
23 Oct, 2002
See effect on networking at Florida State University
17
Peer to Peer Networking (P2P)
„
P2P is sharing resources
„
„
„
„
„
Future DVD Distribution Scheme
„
„
„
„
Directories of services
„ Centralized access to directories
„ Directory search engines
Distributed interaction between Client and Server
Exploit client-server locality (Sun JXTA)
Mobile telephones plan to remove towers
First purchasers buy and download from central site
Subsequent purchasers download P2P
Everyone must buy license before using
Be the first on your block to buy the new movie
„
„
23 Oct, 2002
Server receives payment for every download
Cable broadband has shared local bus structure
18
Jefferson Lab Hall D in 2007
23 Oct, 2002
19
Atlas Detector at CERN
„
23 Oct, 2002
Hall D at
Jefferson Lab is
not very big
20
Computation and Data Rates for Hall D
„
Expected date for full data collection 2007
„
„
Raw data collection and analysis
„
„
„
„
„
„
„
„
23 Oct, 2002
15,000 events per second, 5 KB/event, 75 MB/sec
At 1/3 duty factor, .75 PB/year
To analyze each event twice, 50 CPUs
Analyzed data 1.5 PB/year
Computational simulations of experiments
„
„
Estimates based on expected CPU speeds
Expect to need 5,000 events/sec
Expect .1 CPU-sec/event
Need 500 CPUs
Simulated data .75 PB/year
Total data rate of 3 petabytes per year
21
Hall D Computing Tasks
Calibrations
Acquisition
Data Archival
Data Mining
Physics Analysis
Monitoring
Slow Controls
First Pass
Analysis
Planning
Partial Wave
Analysis
Physics Analysis
Simulation
Publication
23 Oct, 2002
22
Hall D Collaboration Map
23 Oct, 2002
23
Meeting Computational Challenges
Moore’s law: Computer performance increases by
a factor of 2 every 18 months.
Gilder’s Law: Network bandwidth triples every 12
months.
Dennis’s Law: Neither Moore’s Law nor Gilder’s
Law will solve our computing problems.
Solving the information management
problems requires people working on
the software and developing a workable
computing environment.
23 Oct, 2002
24
Database Research is not Dead
„
„
Consider relative speeds of devices
Pentium 120, circa 1996
„
„
„
„
„
„
„
„
„
33 mhz memory bus
10 GByte disk
Today’s Pentium 4
„
„
120 mhz processor
64 mbyte memory
10 mbit/sec ethernet
2800 mhz processor (x 24)
1 GByte memory (x 16)
100 mbit/sec ethernet (x 10)
„
„
400 mhz system bus (x 12)
300 GByte disk (x 30)
The speed and size of data storage is far outstripping
the speed of processors
Hence: Data management is becoming more and more
important
23 Oct, 2002
25
What Scientists Need
„
Self-Sustaining Infrastructure
„
„
Adequately Described Components
„
„
„
„
„
Function, Behaviour, QoS, …
Models Supporting Analysis & Reasoning
„
„
With regular well defined structure
Finding appropriate components
Determining how they compose
Tools for Composition, Diagnosis & Change
Sustainable Economic Model
Reason to Trust the System’s Dependability
23 Oct, 2002
26
Grid Technology
„
Virtual Organisations
„
„
Security
„
„
„
„
„
Discovery
Process Creation
Scheduling
Monitoring
Portability
„
„
But Various Protocols
Resource Management
„
„
Single Sign in, delegation
Distribution & fast FTP
„
„
Sharing & Collaboration
Ubiquitous APIs & Modules
Government Agency Buy in
23 Oct, 2002
Foster, I., Kesselman, C. and Tuecke, S., The Anatomy of the Grid: Enabling
Virtual Organisations, Intl. J. Supercomputer Applications, 15(3), 2001 27
Grid Computing Advantages
„
„
Provide identical access for all collaborators
Utilize all intellectual resources
„
„
„
„
Maximize total funding resources while meeting the total
computing need
Reduce systems’ complexity
„
„
Partitioning of facility tasks, to manage and focus resources
Optimize computing resources to solve problems
„
„
JLab, universities, remote sites
Scientists, students
Tier-n or “Grid” Model
Reduce long-term computational management problems
23 Oct, 2002
28
Foundations for Grid Sites
Interactive
Services
Needs
Very Reliable
Hardware &
Software at
Remote Sites
Computing
Services
23 Oct, 2002
Batch
Services
Grid
Services
Data
Services
Needs Very Reliable, Easy
to Install Software at
Remote Sites
Information
Services
29
Open Grid Services Features
„
WSDL + WSIL
„
„
„
„
„
„
Apache Tomcat/Axis
Globus
„
„
SOAP
RPC
EJB
Representations
„
23 Oct, 2002
XML + Schema
Life Time Management
„
„
„
Invocation
„
„
Description
Discovery
Tools & Platforms
„
„
„
„
„
Authentication
„
„
„
„
Factories
Transient & Persistent GS
GS Handles
GS Records
Soft State
Notification
Certificates +
Delegation
Change Management
Platform Independence
Foster, I., Kesselman, C., Nick, J. and Tuecke, S., The Physiology of the Grid: An Open
Grid Services Architecture for Distributed Systems Integration
30
Grid Data Services OGSA-DAIS
„
„
Common access for Physicists everywhere.
Utilizing all intellectual resources
„
„
„
„
Maximize total funding resources while meeting the total
computing need.
Reduce Systems’ complexity
„
„
Partitioning of facility tasks, to manage and focus resources.
Optimization of computing resources to solve the
problem.
„
„
JLab, universities, remote sites
Scientists, students
Tier-n or “Grid” Model.
Reduce long-term computational management
problems.
23 Oct, 2002
31
Sample Grid Data Service System
„
„
Registry
„
„
Factory
„
Metadata about Services
and Factories
Delivers GSH of factory
„
Grid Service
Registry
„
Grid Data Service
Factory
<find factory>
<Grid Data Service GSH>
Client
Grid Data Service
„
Provides access to
database
Create Grid Data
Service
<create Grid Data Service>
<factory GSH>
Metadata about Services
Creates GDS and returns
its GSH
Grid Data Service
Database
<data request>
<data document>
23 Oct, 2002
32
Requesting Data from a GDS
Grid Data Service
Requester A
<query specification>
GridDataService
Port
<data document>
<GDSS>
<Body>
<Statement name =”xyz1”>
SELECT m1, m2, m3 FROM T1 WHERE …
</Statement>
<Delivery>
<From> xyz1 </From>
<To> GSH of C </to>
</Delivery>
<Execute>xyz1</Execute>
</Body>
</GDSS>
23 Oct, 2002
33
Request with Separate Delivery
„
„
A makes request
B and C receive resulting data
Requester A
<query specification>
Grid Data Service
<query response>
query
identifier
<transport specification>
Requester B
<data document>
GridDataService
Port
<transport specification>
Requester C
23 Oct, 2002
<data document>
34
CS Challenges and Research
„
Semantic Web
„
„
„
„
„
Grid
„
„
„
„
„
Ontologies
Inference engines
XML databases
Efficient access to XML resources
Architecture: OGSI
Transport
Repositories and replication
Security and authorization
OGSA-DAI
„
„
„
„
23 Oct, 2002
Registries and discovery
Integration of DB and scripting languages
XML database update and query
Distributed query processing
35
Web References
„
Web search
„
„
„
„
Web Services
„
„
„
„
„
„
Sun JXTA: http://www.jxta.org/
Kazaa: http://www.kazaa.com
National e-Science Centre
„
„
„
http://www.w3.org/2001/sw/ and RDF/
ebXML: http://www.ebxml.org/
Microsoft UDDI repository: http://uddi.microsoft.com
IBM UDDI: http://www7b.boulder.ibm.com/wsdd/downloads/UDDIregistry.html
Peer to Peer
„
„
http://www.pricewatch.com
http://www.google.com
http://www.mysimon.com
Home: http://www.nesc.ed.uk/
Talks: http://umbriel.dcs.gla.ac.uk/NeSC/general/presentations/
Grid and Grid Computing
„
„
„
23 Oct, 2002
Grid Forum: http://www.gridforum.org
OGSA: http://www.gridforum.org/ogsi-wg/
OGSA-DAIS: http://www.gridforum.org/ogsi-wg/
36
Download