UK e-Science: data, data everywhere, ... Sunlabs, Boston Malcolm Atkinson

advertisement
UK e-Science: data,
data everywhere, ...
Sunlabs, Boston
Malcolm Atkinson
Director
www.nesc.ac.uk
7th October 2005
Data, Data Everywhere & …
•
•
•
•
•
•
•
•
•
Growing volumes
Growing diversity
⇒ Rich resource
Growing complexity
How do we mine its riches for nuggets of information?
Find & Access
Understand
Extract, Combine & Digest
⇒ Use OGSA-DAI
Test hypotheses
Bingo!
OGSA-DAI
1
Story Line
The UK e-Science Context
Example Projects
Their demand for Data Integration
OGSA-DAI
An Extensible Framework
A kit of parts
A way of life
Summary & Take Home Messages
What is e-Science?
Goal: to enable better research in all
disciplines
2
What is e-Science?
Method: Invention and exploitation of
advanced computational methods
to generate, curate and analyse research
data
X
X
From experiments, observations and simulations
Quality management, preservation and reliable evidence
to develop and explore models and
simulations
X
X
Computation and data at extreme scales
Trustworthy, economic, timely and relevant results
to enable dynamic distributed virtual
organisations
X
X
Facilitating collaboration with information and resource
sharing
Security, reliability, accountability, manageability and agility
Why use / build Grids?
Research Arguments
Enables new ways of working
New distributed & collaborative research
Unprecedented scale and resources
Economic Arguments
Reduced system management costs
Shared resources ⇒ better utilisation
Pooled resources ⇒ increased capacity
Load sharing & utility computing
Cheaper disaster recovery
3
Grids: a foundation for e-Science
e-Science methodologies will rapidly transform
science, engineering, medicine and business
driven by exponential growth (×1000/decade)
X
enabling a whole-system approach
computers
software
Grid
sensor nets
instruments
Diagram derived from
Ian Foster’s slide
colleagues
Shared data
archives
Why use / build Grids?
Computer Science Arguments
New attempt at an old hard problem
Frustrating ignorance of existing results
New scale, new dynamics, new scope
Engineering Arguments
Enable autonomous organisations to
X
X
X
Write complementary software components
Set up run & use complementary services
Share operational responsibility
4
Why use / build Grids?
Political & Management Arguments
Stimulate innovation
Promote intra-organisation collaboration
Promote inter-enterprise collaboration
What is e-Infrastructure – Political view
A shared resource
That enables science,
research,
engineering,
medicine, industry, …
It will improve UK /
European / …
productivity
X
X
Lisbon Accord 2000
E-Science Vision SR2000
– John Taylor
Commitment by UK
government
X
Sections 2.23-2.25
Always there
X
c.f. telephones, transport,
power
5
UK e-Science Budget
(2001-2006)
Total: £213M + £100M via JISC
EPSRC Breakdown
M RC (£21.1M )
10%
HPC (£11.5M)
BBSRC (£18M )
15%
8%
EPSRC (£77.7M )
37%
NERC (£15M )
7%
Applied (£35M)
Staff costs 45%Grid Resources
Computers & Network Core (£31.2M)
40%
(£57.6M
)
funded separately PPARC27%
CLRC (£10M )
5%
ESRC (£13.6M )
6%
+ Industrial Contributions £25M
Source: Science Budget 2003/4 – 2005/6, DTI(OST)
Slide from Steve Newhouse
The e-Science
Centres
Globus Alliance
National
Centre for
e-Social
Science
Open
Middleware
Infrastructure
Institute
e-Science
Institute
Grid
Operations
Support
Centre
Digital
Curation
Centre
CeSC (Cambridge)
OMII-UK
EGEE
National
Institute
for
Environmental
e-Science
6
The Primary Requirement …
Building people grids
Enabling People to Work Together on Challenging Projects: Science, Engineering & Medicine
Service-Oriented Architecture
Registry
Discovery
Registration
Invocation
Client
Service
7
Story so Far
UK commits to invest
To stimulate academia & industry
Differentiated by Coordination
Strong leadership from Tony Hey
Initially committed to US Grid M/W
Invested in Web Services
Focus on Data & Information Grids
DAI
Semantic Grids
Application driven development
Story Line
The UK e-Science Context
Example Projects
Their demand for Data Integration
OGSA-DAI
An Extensible Framework
A kit of parts
A way of life
Summary & Take Home Messages
8
Motivation
• Entering an age of data
– Data Explosion
– CERN: LHC will generate 1GB/s =
10PB/y
– VLBA (NRAO) generates 1GB/s
today
– Pixar generate 100 TB/Movie
– Storage getting cheaper
• Data stored in many different ways
– Data resources
– Relational databases
– XML databases
– Flat files
• Need ways to facilitate
– Data discovery
– Data access
– Data integration
OGSA-DAI
• Empower e-Business and e-Science
– The Grid is a vehicle for achieving this
Composing Observations in Astronomy
No. & sizes of data sets as of mid-2002,
grouped by wavelength
• 12 waveband coverage of large
areas of the sky
• Total about 200 TB data
• Doubling every 12 months
• Largest catalogues near 1B objects
OGSA-DAI
Data and images courtesy Alex Szalay, John Hopkins
9
Biomedical data – making
connections
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat
ttggtgttgt
12241 cagtctttta aattttaacc tttagagaag agtcatacag
tcaatagcct tttttagctt
12301 gaccatccta atagatacac agtggtgtct
cactgtgatt ttaatttgca ttttcctgct
12361 gactaattat gttgagcttg
ttaccattta gacaacttca ttagagaagt gtctaatatt
12421 taggtgactt
gcctgttttt ttttaattgg
©
Slide provided by Carole Goble: University of Manchester
Database Growth
PDB Content Growth
Slide provided by Richard Baldock: MRC HGU Edinburgh
10
BRIDGES
C F G V ir t u a l
P u b lic a lly C u r a te d D a ta
E nsem bl
O r g a n is a t io n
O M IM
G la s g o w
S W IS S -P R O T
P r iv a te
E d in b u r g h
MGI
VO Authorisation
P r iv a te
d a ta
d ata
Information
Integrator
…
RGD
L e ic e s te r
DATA
HUB
O x fo rd
P r iv a te
d a ta
OGSA-DAI
s
bla
t
Synteny
Grid
Service
P r iv a te
d a ta
H UG O
N e th e r la n d s
P r iv a te
d a ta
London
P r iv a te
d a ta
+
Slide provided by Richard Sinnott: University of Glasgow
eDiaMoND: Screening for
Breast Cancer
Patients
Screening
1 Trust Æ Many Trusts
Collaborative Working
Audit capability
Epidemiology
Electronic
Patient Records
Radiology reporting
systems
Letters
Case Information
Assessment/ Symptomatic
Biopsy
2ndary Capture
Or FFD
X-Rays and
Case Information
Other Modalities
-MRI
-PET
-Ultrasound
Symptomatic/Assessment
Information
eDiaMoND
Grid
Case and
Reading Information
Better access to
Case information
And digital tools
SMF
CAD
Training
Case and
Reading Information
Digital
Reading
3D Images
Supplement Mentoring
With access to digital
Training cases and sharing
Of information across
clinics
Manage Training Cases
Perform Training
SMF
CAD
Temporal Comparison
Provided by eDiamond project: Prof. Sir Mike Brady et al.
11
Business Intelligence and Customer Data
Customer Data
Customer Order Information
SAP
Oracle
DB2
Siebel
Data Grid
Customer
Support
Web-based
Dashboard
Marketing department identifies
likely buyers of new product
• Company wants real-time integrated
view of customer buying behavior
• Data resides in various distributed
CRM & ERP systems
• Grid allows developers and apps to
access and integrate customer data
sources together in real time--across
many distributed databases
OGSA-DAI
Slide from: Dave Berry, Andrew Grimshaw – OGSA-WG
Providing Data to
Cluster-Based Analytical Application
R&D
West Coast
Testing
India
Engineering
East Coast
Data
Grid
Headquarters
Illinois
Forward Proxy
Data Caches of
Remote Data
Centralized
Compute Cluster
Data
Grid
Data
Grid
Analytical
Applications
Data
Grid
• Company has centralized HPC cluster
running compute-intensive applications
• Source data for analysis distributed
among 3 global sites, one of them an
external partner
• Manual data-sharing processes
increase costs/errors, and hinder timeto-results
• Grid enables secure, automatic
provisioning of remote data to HPC
cluster—feeding CPUs more data faster
OGSA-DAI
Slide from: Dave Berry, Andrew Grimshaw – OGSA-WG
12
Story so far
• There is a lot of data
–
–
–
–
–
–
growing in every dimension
Distributed
Many different producers & owners
Heterogeneous
High value resource
Combined it is more valuable
S
ion
t
olu
ire
u
eq
R
s
d
e
lintegration
• There are many requirements for data
b
a
t
a
– Takes many forms
pe
e
– Driven by insights
R
– Enable conversion of insighticinto tested hypothesis
er
• There are many data owners
en
– Their autonomy andG
policies must be respected
OGSA-DAI
Grid Construction Principles
Dynamic coupling based on SOA
Respect Autonomy – local policies
Independent construction & provision
Requires adherence to agreed protocols
Interoperability
Preferably with widely adopted standards
Algorithms & Information structures
Must scale well
Must be tolerant to partial failure
Must be tolerant to partial system change
Mechanisms to build trust are essential
Much more than just providing services
Providing mutually consistent services with compatible policies
13
Story Line
The UK e-Science Context
Example Projects
Their demand for Data Integration
OGSA-DAI
Architectural Context
An Extensible Framework
A kit of parts
A way of life
Summary & Take Home Messages
Three communities
Users: Individual & Organisations
Data,
Information &
Knowledge
Providers
OGSA-DAI
Compute
Storage &
Communicat
ions
Providers
14
Three communities
Users: Individual & Organisations
Data,
Information &
Knowledge
Providers
Initiate & Steer work
Provide Requirements
Use knowledge & insight
Want easy & reliable facility
Expect agility, reliability, stability
&
latest techniques
Pay the bills
OGSA-DAI
Compute
Storage &
Communicat
ions
Providers
Three communities
Users: Individual & Organisations
Provide & operate resources
Storage centres, Data centres,
DBMS & File Systems
Computation environment,
Processing & Communications
Need to change facilities & policies
Prefer consolidated requirements
Must be paid
Data,
Information &
Knowledge
Providers
OGSA-DAI
Compute
Storage &
Communicat
ions
Providers
15
Three communities
Users: Individual & Organisations
Create & Collect data
Organise & Structure data
Provide, organise & maintain metadata
Offer access and domain specific services
Establish use policies
Will change structure, services & policies
May pay or be paid
Data,
Information &
Knowledge
Providers
OGSA-DAI
Compute
Storage &
Communicat
ions
Providers
Three communities
Users: Individual & Organisations
OGSA-DAI
a
framework for a lasting
& productive relationship
Data,
Information &
Knowledge
Providers
OGSA-DAI
Compute
Storage &
Communicat
ions
Providers
16
OGSA-DAI In One Slide
• An extensible framework for data
access and integration.
• Expose heterogeneous data
resources to a grid through web
services.
• Interact with data resources:
– Queries and updates.
– Data transformation / compression
– Data delivery.
• Customise for you project using
– Additional Activities
– Client Toolkit APIs
– Data Resource handlers
• A base for higher-level services
Slide from Neil Chue Hong
– federation, mining, visualisation,…
Connecting three communities
Users: Individual & Organisations
Portals & Applications
Client Library
ce
fa
In
te
r
te
In
Ad
ap
tiv
e
ed
s
ce
e & rfa
ti v te
s
ap In
ice
Ad ble
rv
i
ns
Se
te
&
Ex
es
ac
rf
id
ov
Pr
Data,
Information &
Knowledge
Providers
s
OGSA-DAI
OGSA-DAI
Compute
Storage &
Communicat
ions
Providers
17
International Collaboration & Use
USA:
o Globus Alliance
o IBM Corporation
o caBIG
o BIRN
o Indiana University
o GridSphere
o GEON
o LEAD
o MCS
o NCSA
o Secure Data Grid
o UNC
Tutorials
Boston
CERN
Edinburgh
San Francisco
Seoul
Tokyo
UK:
o OMII
o OMII-UK
o NGS
o NCeSS
Europe:
o NIeeS
o CERN
o AstroGrid
o DataMiningGrid
o BioSimGrid
o GridMiner
o BRIDGES
o GridSphere
o CancerGrid
o inteligrid
o ConvertGrid
o N2Grid
o eDiaMonD
o OntoGrid
o EDINA
o Provenance
o First Group plc
o SIMDAT
o Fujitsu Labs Europe
o OMII-EU
o GEDDM
o GeneGrid
o Genomic Technology and Informatics
o GOLD
o Human Genetics Unit
Austria
o IBM UK
1%
o myGrid
France
o Oracle UK
3%
Cambridge
Chicago
London
Seattle
Singapore
ISSGC 03 to 05
China:
Japan:
o CAS
o AIST
o ChinaGrid
o BioGrid
o cnGrid
o NAREGI
o INWA
o OMII-China
South Korea:
o KISTI
Others
20%
Australia:
o Curtin Business School
o INWA
China
40%
Italy
2%
Japan
5%
United Kingdom
DIALOGUE
workshops
15%
Columbus, Edinburgh, Indiana, Vienna
1485 registered
users
United States
Chicago, Manchester,
San Diego
11%
Germany
3%
5250+ downloads
Meeting User Requirements
FirstDIG
ConvertGrid
eDiaMoND
GeneGrid
BRIDGES
LEAD
Grid Miner
OGSA WebDB
OGSA-DQP
caBIG
18
OGSA-DAI team
EPCC Team, Edinburgh NeSC, Edinburgh
NEReSC, Newcastle
ESNW, Manchester
OGSA-DAI
IBM Development Team, Hursley IBM Dissemination Team
Inside a Data Service (1/2)
Perform
document
Data Service
Response
document
• Extensible document-based interface:
– Change in behaviour ≠> interface change.
– Reduce number of operation calls.
• Perform document:
– Data resource-related activities – query, update, create, compress,
transform, deliver.
– Data flows between activities.
• Response documents:
OGSA-DAI
– Status of executed activities, result data.
19
DAIS Data Access
Database
Data Service
Consumer
SQLDescription:
Readable
Writeable
ConcurrentAccess
TransactionInitiation
TransactionIsolation
Etc.
SQLExecute ( SQLExpression )
Relational
Database
SQLAccess
SQLResponse
OGSA-DAI
DAIS Derived Data Access
D ataba se
D ata S ervice
C onsum er
SQ LExecuteFactory
( SQ LE xpression
B ehaviouralProperties )
S Q LD escription:
R eadable
W riteable
C oncurrentA ccess
TransactionInitiation
TransactionIsolation
Etc.
R elational
D atabase
SQ LFactory
R eference to
SQ L R esponse D ata Service
R D B M S specific m echanism for
generating result set
S Q L R esp on se
D ata S e rvice
G etR ow set ( row setnum ber )
SQ LR esponseD escription
R ow Set
SQ LR esponseA ccess
R ow set
OGSA-DAI
20
Inside a Data Service (2/2)
Perform
document
Query
Activity
query
Response
document
Engine
data
data
Transform
Activity
Delivery
Activity
data
credentials
connection
credentials
connection
role
DataResource
Mediator
role
Role Mapper
OGSA-DAI
Inside a Data Service (2/2)
Perform
document
Query
Activity
query
Response
document
Engine
data
data
Transform
Activity
Delivery
Activity
data
credentials
connection
credentials
connection
role
DataResource
Mediator
role
Role Mapper
OGSA-DAI
21
OGSA-DAI Deck of Activities
OGSA-DAI
Slide from Mario Antonioletti
OGSA-DAI Architecture
Tx
Request
External Standard Coordination
Tx
Response
22
Disciplines organise specialised extension
Users: Individual & Organisations
Portals & Applications
Medical
image
specialisation
Client Library
…
Astro
specialisation
…
s
ce
fa
In
te
r
te
In
Compute
Storage &
Communicat
ions
Providers
a
rf
Ad
ap
tiv
e
ed
Data,
Information &
Knowledge
Providers
OGSA-DAI
s
ce
e & rfa
ti v te
s
ap In
ice
Ad ible
rv
ns
Se
te …
s&
Ex
ce
id
ov
Pr
Ac
es
i ti
it v
OGSA-DAI
Providing stability
Users: Individual & Organisations
Portals & Applications
ed
te
In
Data,
Information &
Knowledge
Providers
…
Stable interfaces for
resource providers
ce
s
OGSA-DAI
fa
s
ce
e & rfa
ti v te
s
ap In
ice
Ad ble
rv
i
ns
Se
te …
&
Ex
es
ac
rf
id
ov
Pr
Ac
es
i ti
it v
Client Library
In
te
r
Stable interfaces for
application developers
Stable interfaces for
activity developers
Stable interfaces for
data providers
Ad
ap
tiv
e
Stable interfaces for
users
OGSA-DAI
Compute
Storage &
Communicat
ions
Providers
23
Extending capability & interfaces
Users: Individual & Organisations
Add new operations,
more abstraction &
composition tools
it
mun
co m
Portals & Applications
ies
s
ce
fa
In
te
r
te
In
Compute
Storage &
Communicat
ions
Providers
a
rf
Ad
ap
tiv
e
ed
Data,
Information &
Knowledge
Providers
s
ce
e & rfa
ti v te
s
ap In
ice
Ad ible
rv
ns
Se
te …
s&
Ex
ce
id
ov
Pr
r
u se
r&
e
p
o
Client Library devel
… data, riods
Add new functions &
d to ated pe
e
higher abstraction
s
i
po
x
got
e
Exploit new capability
ces able ne
a
f
r
Expose new capability inte
n
offered by providers
o
as
d
r re
offered by data ving ol
o
f
r
es
providers prese
i ti
OGSA-DAI
s
v
y
i
a
t
Alw
Ac
OGSA-DAI
Path towards improvement
Users: Individual & Organisations
Drive abstraction &
composition from
metadata
Portals & Applications
Exploit metadata
Automate adaptation
Optimise parallel, resilient
& pipelined evaluators
OGSA-DAI
Control
system &
extension
behaviour`
fa
ce
s
Negotiate SLAs with
providers & obtain
better description of
resources
te
In
In
te
r
ed
Data,
Information &
Knowledge
Providers
iv
…
s
ce
e & rfa
ti v te
s
ap In
ice
Ad ble
rv
i
ns
Se
te …
&
Ex
es
ac
rf
id
ov
Pr
t
Ac
Client Library
Ad
ap
tiv
e
Automate adaptation to
sustain delivered
functions & extension
Improve metadata
offered by data
es
providers
i ti
OGSA-DAI
Compute
Storage &
Communicat
ions
Providers
24
Story Line
The UK e-Science Context
Example Projects
Their demand for Data Integration
OGSA-DAI
Architectural Context
An Extensible Framework
A kit of parts
A way of life
Summary & Take Home Messages
Challenges versus Goals
Challenge
Heterogeneity & Variety
Complex platform
behaviour
Partial failures
Partial failures + large
tasks
Autonomy – owner’s rights
Independent provision
Scale, costs & latency
Vulnerable to misuse
Diverse & Evolving
Valuable assets
Reputation, equipment,
teams, data,
algorithms, working
practices
Goal
Simple operational
model
Simple application
model
Simple user model
Minimal resource
wastage
Stability & uniformity
Simple resource access
Good performance
Dependable protection
Flexible & agile
IPR & assets well
protected
25
Data Services: challenges
• Scale
– Many sites, large collections, many uses
• Longevity
– Research requirements outlive technical decisions
• Diversity
– No “one size fits all” solutions will work
– Primary Data, Data Products, Meta Data, Administrative data, …
• Many Data Resources
– Independently owned & managed
–
–
–
–
No common goals
No common design
Work hard for agreements on foundation types and ontologies
Autonomous decisions change data, structure, policy, …
– Geographically distributed
• and I haven’t even mentioned security yet!
Slide from Neil Chue Hong
Core features of OGSA-DAI – I
• A framework for building applications
– Supports data access, insert and update
– Relational: MySQL, Oracle, DB2, SQL Server, Postgres
– XML: Xindice, eXist
– Files – CSV, BinX, EMBL, OMIM, SWISSPROT,…
– Supports data delivery
–
–
–
–
SOAP over HTTP
FTP; GridFTP
E-mail
Inter-service
– Supports data transformation
– XSLT
– ZIP; GZIP
– Supports security
– X.509 certificate based security
Slide from Neil Chue Hong
26
Core features of OGSA-DAI – II
• A framework for building data clients
– Client toolkit library for application developers
• A framework for developing functionality
– Extend existing activities, or implement your own
– Mix and match activities to provide functionality you need
• Highly-extensible
– Customise our out-of-the-box product
– Provide your own services, client-side support and data-related functionality
• Comprehensive documentation and tutorials
• Latest release supports GT3.2 (to be deprecated), GT4.0, and Axis 1.2 /
OMII_2 using Java 1.4
Slide from Neil Chue Hong
E-Infrastructure: the path to Nirvana 1
Users: Individual & Organisations
Stability
Extended with
Improved
Abstractions
User Adaptation
Layer
Integration,
Sharing, Trust building,
Resource Managing
Distributed System
Infrastructure
Data
Information &
Knowledge
Providers
New
technology
Discipline
Specific
Toolsets
Languages
& Portals
User Mobility
Compute
Storage &
Communica
tions
Providers
27
E-Infrastructure: the path to Nirvana 2
Users: Individual & Organisations
Stability +
Higher-order
Operations +
Improved (semantic)
Meta data
User Adaptation
Layer
Integration,
Sharing, Trust building,
Resource Managing
Distributed System
Infrastructure
Compute
Storage &
Communica
tions
Providers
Data
Information &
Knowledge
Providers
E-Infrastructure: the path to Nirvana 3
Users: Individual & Organisations
Increasing
Self-organisation
&
Autonomic
Behaviour
Data
Information &
Knowledge
Providers
User Adaptation
Layer
Integration,
Sharing, Trust building,
Resource Managing
Distributed System
Infrastructure
Automated
Generation
Of
Stability
Adapters
&
Translators
Compute
Storage &
Communica
tions
Providers
28
Summary: Take home message
E-Infrastructure is arriving
Built on Grids & Web Services
Data and Information grow in importance
There is a dramatic rate of change
An opportunity for everyone
29
Distributed Systems to Grids
Bespoke pioneering distributed systems, e.g. SABRE & SAGE
Linklider proposes shared multi-site computing
ARPA net
IBM CICS
Ethernet
TCP
Dozens of academic networks
Two dominant protocols
CORBA & DCOM
IP-based Internet
Academic & Research
WWW
1960
1970
1980
1990
2000
Distributed Systems to Grids
Bespoke pioneering distributed systems
D-Grid
starts
Many
research
grids
Linklater proposes shared
multi-site
computing
Using many protocols & M/W stacks
IBM CICS
Web Services
ARPA net
Unicore networks
Dozens of academic
Two dominant protocols
Globus
CORBAStarts
& DCOM
Collaboration via shared
IP-based Internet
bio/chem/medical
I-way
Academic & Research
“DBs”
Condor
WWW
1960
1970
1980
1990
2000
30
Download