Data Challenges in e-Science Aberdeen 2 December 2003

advertisement
Data Challenges in e-Science
Aberdeen
Prof. Malcolm Atkinson
Director
www.nesc.ac.uk
2nd December 2003
What is e-Science?
Foundation for e-Science
e-Science methodologies will rapidly transform
science, engineering, medicine and business
driven by exponential growth (×1000/decade)

enabling a whole-system approach
computers
software
Grid
sensor nets
instruments
Diagram derived from
Ian Foster’s slide
colleagues
Shared data
archives
Three-way Alliance
Multi-national, Multi-discipline, Computer-enabled
Consortia, Cultures & Societies
Theory
Models & Simulations
→
Shared Data
Requires Much
Computing Science
Engineering,
Systems, Notations &
Much Innovation Formal Foundation
Experiment &
Advanced Data
Collection
→
Shared Data
Changes Culture,
New Mores,
New Behaviours
→ Process & Trust
New Opportunities, New Results, New Rewards
Biochemical Pathway Simulator
(Computing Science, Bioinformatics, Beatson Cancer Research Labs)
Closing the information loop – between lab and computational model.
DTI Bioscience Beacon Project
Harnessing Genomics Programme
Slide from Professor Muffy Calder, Glasgow
Why is Data Important?
Data as Evidence – all disciplines
Collections
of
Data
Analysis
&
Models
Hypothesis,
Curiosity or …
Information, knowledge,
decisions & designs
Derived &
Synthesised
Data
Driven by creativity, imagination and perspiration
Data as Evidence - Historically
Personal
collection
Individual’s
idea
Collections
of
Data
Analysis
&
Models
Hypothesis,
Curiosity or …
Information, knowledge,
decisions & designs
Personal
effort
Derived &
Synthesised
Data
Lab
Notebook
Driven by creativity, imagination, perspiration & personal resources
Data as Evidence – Enterprise Scale
Enterprise databases
& (archive) file stores
Agreed Hypothesis
or Goals
Collections
of
Digital Data
Hypothesis,
Curiosity,
Business Goals
Information, knowledge,
decisions & designs
Data Production
Pipelines
Analysis &
Computational
Models
Derived &
Synthesised
Data
Data Products
& Results
Driven by creativity, imagination, perspiration & company’s resources
Data as Evidence – e-Science
Multi-enterprise
& Public Curation
Communities
and Challenges
Collections
of Published
& Private Data
Shared Goals
Multiple
hypotheses
Information, knowledge,
decisions & designs
Synthesis from
Multiple Sources
Multi-enterprise
Models,
Computation &
Workflow
Analysis
Computation
Annotation
Derived &
Synthesised
Data
Shared
Data Products
& Results
Driven by creativity, imagination, perspiration & shared resources
global in-flight engine diagnostics
100,000 aircraft
0.5 GB/flight
in-flight data
4 flights/day
200 TB/day
global network
eg SITA
airline
ground
station
DS&S Engine Health Center
internet, e-mail, pager
maintenance centre
data centre
Distributed Aircraft Maintenance Environment: Universities of Leeds, Oxford, Sheffield &York
LHC Distributed Simulation
& Analysis
1 TIPS = 25,000 SpecInt95
~PBytes/sec
Online System
~100 MBytes/sec
•100 triggers per second
•Each event is ~1 Mbyte
US Regional
Centre
Tier
3
~ Gbits/sec
or Air Freight
Italian Regional
Centre
Institute
Institute
~0.25TIPS
Workstations
CERN Computer
Centre >20 TIPS
Tier
0
French Regional
Centre
Tier
2
~Gbits/sec
Physics data
cache
PC (1999) = ~15 SpecInt95
Offline Farm
~20 TIPS
~100 MBytes/sec
•One bunch crossing per 25 ns
Tier
1
1. CERN
ScotGRID++
~1 TIPS
RAL Regional
Centre
Tier2 Centre
Tier2 Centre
Tier2 Centre
~1 TIPS ~1 TIPS ~1 TIPS
Physicists work on analysis “channels”
Institute
Institute
100 - 1000
Mbits/sec
Tier
4
Each institute has ~10 physicists working on
one or more channels
Data for these channels should be cached by
the institute server
DataGrid Testbed
Testbed Sites(>40)
HEP sites
ESA sites
Dubna
Lund
Moscow
RAL Estec KNMI Berlin
IPSL
Paris
Santander
Lisboa
CERN
Prague
Brno
Lyon
Grenoble
Milano
PD-LNL
Torino
Madrid
Marseille Pisa BO-CNAF
Barcelona
ESRIN
Roma
Valencia
Catania
Francois.Etienne@in2p3.fr - Antonia.Ghiselli@cnaf.infn.it
Multiple overlapping communities
Collections
of Published
Collections
Collections
& Private Data
of Published
of Published
& Private
DataData
& Private
Shared Goals
Multiple
Shared Goals
hypotheses
Multiple
Shared Goals
hypotheses
Multiple
hypotheses
Analysis
Computation
Analysis
Annotation
Computation
Analysis
Annotation
Computation
Derived & Annotation
Derived
&
Synthesised
Synthesised
Data
Data
Derived &
Synthesised
Data
Supported by common standards & shared infrastructure
Life-science Examples
Database Growth
Bases 45,356,382,990
PDB Content Growth
Wellcome Trust: Cardiovascular
Functional Genomics
Glasgow
Shared data
BRIDGES
IBM
Edinburgh
Public curated
data
Leicester
Oxford
London
Netherlands
Depends on building & maintaining security, privacy & trust
Comparative
Functional Genomics
Large amounts of data
Highly heterogeneous
Data types
Data forms
community
Highly complex and
inter-related
Volatile
myGrid
Project: Carole Goble, University of Manchester
UCSF
UIUC
From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign
Home Computers
Evaluate AIDS Drugs
Community =
1000s of home computer
users
Philanthropic computing
vendor (Entropia)
Research group (Scripps)
Common goal= advance
AIDS
research
From Steve Tuecke 12 Oct. 01
Astronomy Examples
Global Knowledge Communities
driven by Data: e.g., Astronomy
No. & sizes of data sets as of mid-2002,
grouped by wavelength
• 12 waveband coverage of large
areas of the sky
• Total about 200 TB data
• Doubling every 12 months
• Largest catalogues near 1B objects
Data and images courtesy Alex Szalay, John Hopkins
Sloan Digital Sky Survey
Production System
Slide from Ian Foster’s ssdbm 03 keynote
Supernova Cosmology Requires Complex,
Widely Distributed Workflow Management
Engineering Examples
whole-system simulations
wing models
•lift capabilities
•drag capabilities
•responsiveness
airframe models
stabilizer models
•deflection capabilities
•responsiveness
crew capabilities
- accuracy
- perception
- stamina
- reaction times
- SOP’s
engine models
human models
•braking performance
•steering capabilities
•traction
•dampening capabilities
landing gear models
•thrust performance
•reverse thrust performance
•responsiveness
•fuel consumption
NASA Information Power Grid: coupling all sub-system simulations - slide from Bill Johnson
Mathematicians Solve NUG30
Looking for the solution to the
NUG30 quadratic assignment
problem
An informal collaboration of
mathematicians and computer
scientists
Condor-G delivered 3.46E8
CPU seconds in 7 days (peak
1009 processors) in U.S. and
Italy (8 sites)
14,5,28,24,1,3,16,15,
10,9,21,2,4,29,25,22,
13,26,17,30,6,20,19,
8,18,7,27,12,11,23
MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin
From Miron Livny 7 Aug. 01
Network for Earthquake
Engineering Simulation
NEESgrid: national
infrastructure to couple
earthquake engineers with
experimental facilities,
databases, computers, & each
other
On-demand access to
experiments, data streams,
computing, archives,
collaboration
NEESgrid: Argonne, Michigan, NCSA, UIUC, USC
From Steve Tuecke 12 Oct. 01
National Airspace Simulation Environment
stabilizer models
engine models
44,000 wing runs
wing models
GRC
50,000 engine runs
airframe models
66,000 stabilizer
runs
ARC
LaRC
22,000 commercial
US flights a day
48,000 human
crew runs
human models
simulation
drivers
Virtual
National Air
Space
VNAS
22,000 airframe
impact runs
• FAA ops data
• weather data
132,000 landing/
• airline schedule data take-off gear runs
• digital flight data
• radar tracks
landing
gear
• terrain data
models
• surface data
NASA Information Power Grid: aircraft, flight paths, airport operations and the environment
are combined to get a virtual national airspace
Data Challenges
It’s Easy to Forget
How Different 2003 is From 1993
Enormous quantities of data: Petabytes
For an increasing number of communities
gating step is not collection but analysis
Ubiquitous Internet: >100 million hosts
Collaboration & resource sharing the norm
Security and Trust are crucial issues
Ultra-high-speed networks: >10 Gb/s
Global optical networks
Bottlenecks: last kilometre & firewalls
Huge quantities of computing: >100 Top/s
Moore’s law gives us all supercomputers
Organising their effective use is the challenge
Moore’s law everywhere
Instruments, detectors, sensors, scanners, …
Organising their effective use is the challenge
Derived from Ian Foster’s slide at ssdbM July 03
Tera → Peta Bytes
RAM time to move
15 minutes
1Gb WAN move time
10 hours ($1000)
Disk Cost
7 disks = $5000 (SCSI)
Disk Power
100 Watts
Disk Weight
5.6 Kg
Disk Footprint
Inside machine
RAM time to move
2 months
1Gb WAN move time
14 months ($1 million)
Disk Cost
6800 Disks + 490 units +
32 racks = $7 million
Disk Power
100 Kilowatts
Disk Weight
33 Tonnes
Disk Footprint
60 m2
May 2003 Approximately Correct
See also Distributed Computing Economics Jim Gray, Microsoft Research, MSR-TR-2003-24
The Story so Far
Technology enables Grids, More Data & …
Information Grids will dominate
Collaboration essential
Combining approaches
Combining skills
Sharing resources
(Structured) Data is the language of Collaboration
Data Access & Integration a Ubiquitous Requirement
Many hard technical challenges
Scale, heterogeneity, distribution, dynamic variation
Intimate combinations of data and computation
With unpredictable (autonomous) development of both
Scientific Data
Opportunities
Global Production of
Published Data
Volume Diversity
Combination  Analysis
 Discovery
Opportunities
Specialised Indexing
New Data Organisation
New Algorithms
Varied Replication
Shared Annotation
Intensive Data &
Computation
Challenges
Data Huggers
Meagre metadata
Ease of Use
Optimised integration
Dependability
Challenges
Fundamental Principles
Approximate Matching
Multi-scale optimisation
Autonomous Change
Legacy structures
Scale and Longevity
Privacy and Mobility
UK e-Science
UK e-Science
e-Science and the Grid
‘e-Science is about global collaboration in key
areas of science, and the next generation of
infrastructure that will enable it.’
‘e-Science will change the dynamic of the
way science is undertaken.’
John Taylor
Director General of Research Councils
Office of Science and Technology
From presentation by Tony Hey
e-Science Programme’s Vision
UK will lead the in the exploitation of
e-Infrastructure
New, faster and better research
Engineering design, medical diagnosis,
decision support, …
e-Business, e-Research, e-Design & eDecision
Depends on Leading e-Infrastructure
development & deployment
e-Science and SR2002
Research Council
Medical
Biological
Environmental
Eng & Phys
HPC
Core Prog.
Particle Phys & Astro
Economic & Social
Central Labs
2004-6
£13.1M
£10.0M
£8.0M
£18.0M
£2.5M
£16.2M
£31.6M
£10.6M
£5.0M
2001-4
(£8M)
(£8M)
(£7M)
(£17M)
(£9M)
(£15M) + £20M
(£26M)
(£3M)
(£5M)
Globus
Alliance
National
eScience
Centre
National e-Science Institute
International relationships
Engineering Task Force
Grid Support Centre
Architecture Task Force
OGSA-DAI
One of 11 Centre Projects
GridNet to support standards work
One of 6 administration projects
www.nesc.ac.uk
Training
team
5 Application projects
15 “Fundamental” Research projects
EGEE
HPC(x)
NeSI in Edinburgh
National e-Science
Centre
NeSI Events held in the 2nd Year
(from 1 Aug 2002 to 31 Jul 2003)
We have had 86 events: (Year 1 figures in brackets)
11 project meetings
( 4)
11 research meetings
( 7) > 3600 Participant Days
25 workshops
(17 + 1)
Establishing a training
2 “summer” schools
(0)
team
15 training sessions
(8)
Investing in community
12 outreach events
(3)
building, skill generation &
knowledge development
5 international meetings
(1)
5 e-Science management meetings (7)
(though the definitions are fuzzy!)
Suggestions always
welcome
NeSI Workshops
Space for real work
http://www.nesc.ac.uk/events/
Crossing communities
Creativity: new strategies and solutions
Written reports
Scientific Data Mining, Integration and Visualisation
Grid Information Systems
Portals and Portlets
Virtual Observatory as a Data Grid
Imaging, Medical Analysis and Grid Environments
Grid Scheduling
Provenance & Workflow
GeoSciences & Scottish Bioinformatics Forum
E-Infrastructure
Infrastructure Architecture
Data Intensive Users
Data Intensive Applications for Application area X
Simulation, Analysis & Integration Technology for Application area X
Generic Virtual Data Access and Integration Layer
Job Submission
Brokering
Registry
Banking
Data Transport
Workflow
Structured Data
Integration
Authorisation
OGSA
Resource Usage Transformation Structured Data Access
OGSI: Interface to Grid Infrastructure
Compute, Data & Storage Resources
Structured Data
Relational
Distributed
Virtual Integration Architecture
XML Semi-structured
-
Data Access & Integration Services
1a. Request to Registry
for sources of data
about “x”
SOAP/HTTP
Registry
1b. Registry
responds with
Factory handle
service creation
API interactions
2a. Request to Factory for access
to database
Factory
Client
2c. Factory returns
handle of GDS to
client
3a. Client queries GDS with
XPath, SQL, etc
3c. Results of query returned to
client as XML
2b. Factory creates
GridDataService to manage
access
Grid Data
Service
XML /
Relationa
l
database
3b. GDS interacts with database
Future DAI Services?
1a. Request to Registry for
sources of data about “x” &
“y”
1b. Registry
responds with
Factory handle
Data
Registry
SOAP/HTTP
service creation
API interactions
2a. Request to Factory for access and
integration from resources Sx and Sy
Data Access
& Integration
master
2c. Factory
returns handle of GDS to client
3b.
Client
Problem
tells“scientific”
Solving
analyst
Client
Application
Environment
coding
scientific
insights
Analyst
2b. Factory creates
Semantic
GridDataServices network
Meta data
3a. Client submits sequence of
scripts each has a set of queries
to GDS with XPath, SQL, etc
GDTS1
GDS
GDTS
XML
database
GDS2
Sx
3c. Sequences of result sets returned to
analyst as formatted binary described in
a standard XML notation
Application Code
GDS
GDS1
Sy
GDS3
GDS
GDTS2
GDTS
Relational
database
Integration is our Focus
Supporting Collaboration
Bring together disciplines
Bring together people engaged in shared challenge
Inject initial energy
Invent methods that work
Supporting Collaborative Research
Integrate compute, storage and communications
Deliver and sustain integrated software stack
Operate dependable infrastructure service
Integrate multiple data sources
Integrate data and computation
Integrate experiment with simulation
Integrate visualisation and analysis
High-level tools and automation essential
Fundamental research as a foundation
Take Home Message
Data is a Major Source of Challenges
AND an Enabler of

New Science, Engineering , Medicine, Planning, …
Information Grids
Support for collaboration
Support for computation and data grids
Structured data is fundamental
Integrated strategies & technologies needed
E-Infrastructure is Here – More to do – technically &
socio-economically
Join in – explore the potential – develop the methods & standards
NeSC would like to help you develop e-Science
We seek suggestions and collaboration
Download